The DokuWiki Parser

This document explains the details of the DokuWiki parser and is intended for developers who want to modify the parser's behaviour or gain control over the output document, perhaps modifying the generated HTML or implementing different output formats.

Overview

The parser breaks down the process of transforming a raw DokuWiki document to the final output document (normally XHTML) into discrete stages. Each stage is represented by one or more PHP classes.

Broadly these elements are;

Lexer¹⁾: scans ²⁾ from a DokuWiki document and outputs a sequence of “tokens” ³⁾, corresponding to the syntax in the document
Handler⁴⁾: receives the tokens from the Lexer and transforms them into a sequence of “instructions” ⁵⁾. It describes how the document should be rendered, from start to finish, without the Renderer needing to keep track of state.
Parser⁶⁾: “connects up” the Lexer with the Handler, providing the DokuWiki syntax rules as well as the point of access to the system (the Parser::parse() method)
Renderer⁷⁾: accepts the instructions from the Handler and “draws” the document ready for viewing (e.g. as XHTML)

No mechanism is provided for connecting up the Handler with the Renderer - this needs coding per specific use case.

A rough diagram of the relationships between these components;

                                                               
      +-----------+          +-----------+
      |           |  Input   |  Client   |
      |  Parser   |<---------|   Code    |
      |           |  String  |           |
      +-----.-----+          +-----|-----+
    Modes   |                     /|\
      +     |             Renderer |
    Input   |          Instructions|
    String \|/                     |
      +-----'-----+          +-----------+
      |           |          |           |
      |  Lexer    |--------->|  Handler  |
      |           |  Tokens  |           |
      +-----.-----+          +-----------+
            |
            |
       +----+---+
       | Modes  |-+
       +--------+ |-+
         +--------+ |
           +--------+

The “Client Code” (code using the Parser) invokes the Parser, giving it the input string. It receives, in return, the list of “Renderer Instructions”, built by the Handler. These can then be fed to some object which implements the Renderer.

Note: A critical point behind this design is the intent to allow the Renderer to be as “dumb” as possible. It should not need to make further interpretation / modification of the instructions it is given but be purely concerned with rendering some kind of output (e.g. XHTML) - in particular the Renderer should not need to keep track of state. By keeping to this principle, aside from making Renderers easy to implement (the focus being purely on what to output), it will also make it possible for Renderers to be interchangeable (e.g. output PDF as alternative to XHTML). At the same time, the instructions output from the Handler are geared for rendering XHTML and may not be entirely suited for all output formats.

Lexer

Defined in the folder inc/Parsing/Lexer

In the most general sense, it provides a tool for managing complex regular expressions, where state is important. The Lexer comes from Simple Test but contains three modifications (read: hacks);

support for lookback and lookahead patterns
support for changing the pattern modifiers from within the pattern
notifying the Handler of the starting byte index in the raw text, where a token was matched

In short, Simple Test’s lexer acts as a tool to make regular expressions easy to manage - rather than giant regexes you write many small / simple ones. The lexer takes care of combining them efficiently then gives you a SAX-like callback API to allow you to write code to respond to matched “events”.

The Lexer as a whole is made of three main classes;

dokuwiki\Lexer\ParallelRegex: allows regular expressions to be constructed from multiple, separate patterns, each pattern being associated with an identifying “label” , the class combining them into a single regex using subpatterns. When using the Lexer, you should not need to worry about this class.
dokuwiki\Lexer\StateStack: provides a simple state machine so that lexing can be “context aware”. When using the Lexer, you should not need to worry about this class.
dokuwiki\Lexer\Lexer: provides the point of access for client code using the Lexer. Manages multiple instances of ParallelRegex, using the StateStack to apply the correct ParallelRegex instance, depending on “context”. On encountering “interesting text”, it calls functions on a user provided object (the Handler).

The need for state

The wiki syntax used in DokuWiki contains markup, “inside” of which only certain syntax rules apply. The most obvious example is the <code/> tag, inside of which no other wiki syntax should be recognized by the Lexer. Other syntax, such as the list or table syntax should allow some markup but not others e.g. you can use links in a list context but not tables.

The Lexer provides “state awareness” allowing it to apply the correct syntax rules depending on its current position (the context) in the text it's scanning. If it sees an opening <code> tag, it should switch to a different state within which no other syntax rules apply (i.e. anything that would normally look like wiki syntax should be treated as “dumb” text) until it finds the close </code> tag.

Lexer Modes

The term mode is a label for a particular lexing state⁸⁾. The code using the Lexer registers one or more regex patterns with a particular named mode. Then, as the Lexer matches those patterns against the text it is scanning, it calls functions on the Handler with the same name as the mode (unless the mapHandler method was used to create an alias - see below).

The Lexer API

A short introduction to the Lexer can be found at Simple Test Lexer Notes. This provides more detail.

The key methods in the Lexer are;

Constructor

Accepts an object reference to the Handler, a name of the initial mode that the Lexer should start in and (optionally) a boolean flag as to whether pattern matching should be case sensitive.

Example;

$handler = new MyHandler();
$lexer = new dokuwiki\Lexer\Lexer($handler, 'base', true);

Here the initial mode here is called 'base'.

addEntryPattern / addExitPattern

addEntryPattern() and addExitPattern() are used to register a pattern for entering and exiting a particular parsing mode. For example;

// arg0: regex to match - note no need to add start/end pattern delimiters
// arg1: name of mode where this entry pattern may be used
// arg2: name of mode to enter
$lexer->addEntryPattern('<file>','base','file');
 
// arg0: regex to match
// arg1: name of mode to exit
$lexer->addExitPattern('</file>','file');

The above would allow the <file/> tag to be used from the base mode to enter a new mode (called file). If further modes should be applied while the Lexer is inside the file mode, these would need to be registered with the file mode.

Note: there's no need to use pattern start and end delimiters.

addPattern

addPattern() is used to trigger additional “tokens” inside an existing mode (no transitions). It accepts a pattern and the name of a mode it should be used inside.

This is best seen from considering the list syntax in the parser. Lists syntax looks like this in DokuWiki;

Before the list
  * Unordered List Item
  * Unordered List Item
  * Unordered List Item
After the list

Using addPattern() it becomes possible to match the complete list at once while still exiting correctly and tokenizing each list item;

// Match the opening list item and change mode
$lexer->addEntryPattern('\n {2,}[\*]','base','list');
 
// Match new list items but stay in the list mode
$lexer->addPattern('\n {2,}[\*]','list');
 
// If it's a linefeed that fails to match the above addPattern rule, exit the mode
$lexer->addExitPattern('\n','list');

addSpecialPattern

addSpecialPattern() is used to enter a new mode just for the match then drop straight back into the “parent” mode. Accepts a pattern, a name of a mode it can be applied inside and the name of the “temporary” mode to enter for the match. Typically this would be used if you want to substitute wiki markup with something else. For example to match a smiley like :-) you might have;

$lexer->addSpecialPattern(':-)','base','smiley');

mapHandler

mapHandler() allows a particular named mode to be mapped onto a method with a different name in the Handler. This may be useful when differing syntax should be handled in the same manner, such as the DokuWiki syntax for disabling other syntax inside a particular text block;

$lexer->addEntryPattern('<nowiki>','base', 'unformatted');
$lexer->addEntryPattern('%%','base', 'unformattedalt');
$lexer->addExitPattern('</nowiki>', 'unformatted');
$lexer->addExitPattern('%%', 'unformattedalt');
 
// Both syntaxes should be handled the same way...
$lexer->mapHandler('unformattedalt', 'unformatted');

Subpatterns Not Allowed

Because the Lexer itself uses subpatterns (inside the ParallelRegex class), code using the Lexer cannot. This may take some getting used to but, generally, the addPattern() method can be applied for solving the types problems where subpatterns are typically applied. It has the advantage of keeping regexs simpler and thereby easier to manage.

Syntax Errors and State

To prevent “badly formed” (in particular a missing closing tag) markup causing the Lexer to enter a state (mode) which it never leaves, it can be useful to use a lookahead pattern to check for the closing markup first⁹⁾. For example;

// Use lookahead in entry pattern...
$lexer->addEntryPattern('<file>(?=.*</file>)','base','file');
$lexer->addExitPattern('</file>','file');

The entry pattern checks it can find a closing </file> tag before it enters the state.

Handler

Defined in inc/parser/handler.php and the folder inc/Parsing/Handler

The Handler is a class providing methods which are called by the Lexer as it matches tokens. It then “fine tunes” the tokens into a sequence of instructions ready for a Renderer.

The Handler as a whole contains the following classes:

Doku_Handler: all calls from the Lexer are made to this class. For every mode registered with the Lexer, there will be a corresponding method in the Handler
dokuwiki\Parsing\Handler\CallWriter: provides a layer between the array of instructions (the Doku_Handler::$calls array) and the Handler methods writing the instructions. It will be temporarily replaced with other objects, such as dokuwiki\Parsing\Handler\List, while lexing is in progress.
dokuwiki\Parsing\Handler\Lists: responsible for transforming list tokens into instructions while lexing is still in progress
dokuwiki\Parsing\Handler\Quote: responsible for transforming blockquote tokens (text beginning with one or more >) into instructions while lexing is still in progress
dokuwiki\Parsing\Handler\Table: responsible for transforming table tokens into instructions while lexing is still in progress
dokuwiki\Parsing\Handler\Block: responsible for inserting 'p_open' and 'p_close' instructions, while being aware of 'block level' instructions, once all lexing has finished (i.e. it loops once through the complete list of instructions and inserts more instructions)
dokuwiki\Parsing\Handler\AbstractRewriter: …extended by Preformatted and Nest …
dokuwiki\Parsing\Handler\Nest: …
dokuwiki\Parsing\Handler\Preformatted: responsible for transforming preformatted tokens (indented text in dokuwiki) into instructions while lexing is still in progress

Interfaces:

dokuwiki\Parsing\Handler\CallWriterInterface: … implemented by CallWriter
dokuwiki\Parsing\Handler\ReWriterInterface: … extends the CallWriterInterface, implemented by AbstractRewriter

Handler Token Methods

The Handler must provide methods named corresponding to the modes registered with the Lexer (bearing in mind the Lexer mapHandler() method - see above).

For example if you registered a file mode with the Lexer like;

$lexer->addEntryPattern('<file>(?=.*</file>)','base','file');
$lexer->addExitPattern('</file>','file');

The Handler will need a method like;

class Doku_Handler {
 
    /**
    * @param string match contains the text that was matched
    * @param int state - the type of match made (see below)
    * @param int pos - byte index where match was made
    */
    public function file($match, $state, $pos) {
        return true;
    }
}

Note: a Handler method must return true or the Lexer will halt immediately. This behaviour can be useful when dealing with other types of parsing problem but for the DokuWiki parser, all Handler methods will always return true.

The arguments provided to a handler method are;

$match: the text that was matched
$state: this is a constant which describes how exactly the match was made;
1. DOKU_LEXER_ENTER: matched an entry pattern (see Lexer::addEntryPattern)
2. DOKU_LEXER_MATCHED: matched a pattern (see Lexer::addPattern)
3. DOKU_LEXER_UNMATCHED: some text found inside the mode which matched no patterns
4. DOKU_LEXER_EXIT: matched an exit pattern (see Lexer::addExitPattern)
5. DOKU_LEXER_SPECIAL: matched a special pattern (see Lexer::addSpecialPattern)
$pos: this is the byte index (strlen from start) where the start of the token was found. $pos + strlen($match) should give the byte index of the end of the match

As a more complex example, in the Parser the following is defined for matching lists;

public function connectTo($mode) {
    $this->Lexer->addEntryPattern('\n {2,}[\-\*]', $mode, 'listblock');
    $this->Lexer->addEntryPattern('\n\t{1,}[\-\*]', $mode, 'listblock');
 
    $this->Lexer->addPattern('\n {2,}[\-\*]', 'listblock');
    $this->Lexer->addPattern('\n\t{1,}[\-\*]', 'listblock');
}
 
public function postConnect() {
    $this->Lexer->addExitPattern('\n', 'listblock');
}

The listblock method in the Handler ¹⁰⁾, looks like;

public function listblock($match, $state, $pos) {
    switch ($state) {
        // The start of the list...
        case DOKU_LEXER_ENTER:
            // Create the List rewriter, passing in the current CallWriter
            $reWriter = new dokuwiki\Parsing\Handler\Lists($this->callWriter);
 
            // Replace the current CallWriter with the List rewriter
            // all incoming tokens (even if not list tokens)
            // are now diverted to the list
            $this->callWriter = $reWriter;
 
            $this->addCall('list_open', [$match], $pos);
            break;
 
        // The end of the list
        case DOKU_LEXER_EXIT:
            $this->addCall('list_close', [], $pos);
 
            // Tell the List rewriter to clean up
            $this->callWriter->process();
 
            // Restore the old CallWriter
            $reWriter = $this->callWriter;
            $this->callWriter = $reWriter->callWriter;
            break;
 
        case DOKU_LEXER_MATCHED:
            $this->addCall('list_item', [$match], $pos);
            break;
 
        case DOKU_LEXER_UNMATCHED:
            $this->addCall('cdata', [$match], $pos);
            break;
    }
    return true;
}

Token Conversion

Part of the fine tuning, performed by the handler, involves inserting / renaming or removing tokens provided by the Lexer.

For example, a list like;

This is not a list
  * This is the opening list item
  * This is the second list item
  * This is the last list item
This is also not a list

Would result in sequence of tokens something like;

base: "This is not a list", DOKU_LEXER_UNMATCHED
listblock: "\n *", DOKU_LEXER_ENTER
listblock: " This is the opening list item", DOKU_LEXER_UNMATCHED
listblock: "\n *", DOKU_LEXER_MATCHED
listblock: " This is the second list item", DOKU_LEXER_UNMATCHED
listblock: "\n *", DOKU_LEXER_MATCHED
listblock: " This is the last list item", DOKU_LEXER_UNMATCHED
listblock: "\n", DOKU_LEXER_EXIT
base: "This is also not a list", DOKU_LEXER_UNMATCHED

But to be useful to the Renderer, this has to be converted to the following instructions;

p_open:
cdata: "This is not a list"
p_close:
listu_open:
listitem_open:
cdata: " This is the opening list item"
listitem_close:
listitem_open:
cdata: " This is the second list item"
listitem_close:
listitem_open:
cdata: " This is the last list item"
listitem_close:
list_close:
p_open:
cdata: "This is also not a list"
p_close:

In the case of lists, this requires the help of the dokuwiki\Parsing\Handler\Lists class, which has its own knowledge of state and is captures the incoming tokens, replacing them with the correct instructions for a Renderer.

Parser

Defined in inc\Parsing\Parser.php and inc/parser/parser.php.

The dokuwiki\Parsing\Parser acts as the front end to external code and sets up the Lexer with the patterns and modes describing DokuWiki syntax.

Using the Parser will generally look like:

// Create the Handler
$handler = new Doku_Handler();
 
// Create the parser with the handler
$parser = new dokuwiki\Parsing\Parser($handler);
 
// Add required syntax modes to parser
$parser->addMode('footnote', new dokuwiki\Parsing\ParserMode\Footnote());
$parser->addMode('hr', new dokuwiki\Parsing\ParserMode\Hr());
$parser->addMode('unformatted', new dokuwiki\Parsing\ParserMode\Unformatted());
# etc.

$doc = file_get_contents('wikipage.txt.');
$instructions = $parser->parse($doc);

More detailed examples are below.

As a whole, the Parser also contains classes representing each available syntax mode, the base class for all of these being dokuwiki\Parsing\ParserMode\AbstractMode. The behaviour of these modes are best understood by looking at the examples of adding syntax later in this document.

The reason for representing the modes with classes is to avoid repeated calls to the Lexer methods. Without them it would be necessary to hard code each pattern rule for every mode that pattern can be matched in, for example, registering a single pattern rule for the CamelCase link syntax would require something like;

$lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b', 'base', 'camelcaselink');
$lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b', 'footnote', 'camelcaselink');
$lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b', 'table', 'camelcaselink');
$lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b', 'listblock', 'camelcaselink');
$lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b', 'strong', 'camelcaselink');
$lexer->addSpecialPattern('\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b', 'underline', 'camelcaselink');
// etc.

Each mode that is allowed to contain CamelCase links must be explicitly named.

Rather than hard coding this, instead it is implemented using a single class like;

namespace dokuwiki\Parsing\ParserMode;
 
class CamelCaseLink extends AbstractMode {
 
    public function connectTo($mode) {
        $this->Lexer->addSpecialPattern(
            '\b[A-Z]+[a-z]+[A-Z][A-Za-z]*\b', $mode, 'camelcaselink'
        );
    }
}

When setting up the Lexer, the Parser calls the connectTo() method on the dokuwiki\Parsing\ParserMode\CamelCaseLink object for every other mode which accepts the CamelCase syntax (some don't such as the <code /> syntax).

At the expense of making the Lexer setup harder to understand, this allows the code to be more flexible when adding new syntax.

Instructions Data Format

Parserarray plugin is a export renderer that show the instructions for the current page. It may help you to understand the data format. The following shows an example of raw wiki text and the corresponding output from the parser;

The raw text (contains a table);

abc
| Row 0 Col 1    | Row 0 Col 2     | Row 0 Col 3        |
| Row 1 Col 1    | Row 1 Col 2     | Row 1 Col 3        |
def

When parsed the following PHP array is returned (described below);

Array(
    [0] => Array(
            [0] => document_start
            [1] => Array()
            [2] => 0
        )
    [1] => Array(
            [0] => p_open
            [1] => Array()
            [2] => 0
        )
    [2] => Array(
            [0] => cdata
            [1] => Array(
                    [0] => 

abc
                )
            [2] => 0
        )
    [3] => Array(
            [0] => p_close
            [1] => Array()
            [2] => 5
        )
    [4] => Array(
            [0] => table_open
            [1] => Array(
                    [0] => 3
                    [1] => 2
                )
            [2] => 5
        )
    [5] => Array(
            [0] => tablerow_open
            [1] => Array()
            [2] => 5
        )
    [6] => Array(
            [0] => tablecell_open
            [1] => Array(
                    [0] => 1
                    [1] => left
                )
            [2] => 5
        )
    [7] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>  Row 0 Col 1
                )
            [2] => 7
        )
    [8] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>     
                )
            [2] => 19
        )
    [9] => Array(
            [0] => tablecell_close
            [1] => Array()
            [2] => 23
        )
    [10] => Array(
            [0] => tablecell_open
            [1] => Array(
                    [0] => 1
                    [1] => left
                )
            [2] => 23
        )
    [11] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>  Row 0 Col 2
                )
            [2] => 24
        )
    [12] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>      
                )
            [2] => 36
        )
    [13] => Array(
            [0] => tablecell_close
            [1] => Array()
            [2] => 41
        )
    [14] => Array(
            [0] => tablecell_open
            [1] => Array(
                    [0] => 1
                    [1] => left
                )
            [2] => 41
        )
    [15] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>  Row 0 Col 3
                )
            [2] => 42
        )
    [16] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>         
                )
            [2] => 54
        )
    [17] => Array(
            [0] => tablecell_close
            [1] => Array()
            [2] => 62
        )
    [18] => Array(
            [0] => tablerow_close
            [1] => Array()
            [2] => 63
        )
    [19] => Array(
            [0] => tablerow_open
            [1] => Array()
            [2] => 63
        )
    [20] => Array(
            [0] => tablecell_open
            [1] => Array(
                    [0] => 1
                    [1] => left
                )
            [2] => 63
        )
    [21] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>  Row 1 Col 1
                )
            [2] => 65
        )
    [22] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>     
                )
            [2] => 77
        )
    [23] => Array(
            [0] => tablecell_close
            [1] => Array()
            [2] => 81
        )
    [24] => Array(
            [0] => tablecell_open
            [1] => Array(
                    [0] => 1
                    [1] => left
                )
            [2] => 81
        )
    [25] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>  Row 1 Col 2
                )
            [2] => 82
        )
    [26] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>      
                )
            [2] => 94
        )
    [27] => Array(
            [0] => tablecell_close
            [1] => Array()
            [2] => 99
        )
    [28] => Array(
            [0] => tablecell_open
            [1] => Array(
                    [0] => 1
                    [1] => left
                )
            [2] => 99
        )
    [29] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>  Row 1 Col 3
                )
            [2] => 100
        )
    [30] => Array(
            [0] => cdata
            [1] => Array(
                    [0] =>         
                )
            [2] => 112
        )
    [31] => Array(
            [0] => tablecell_close
            [1] => Array()
            [2] => 120
        )
    [32] => Array(
            [0] => tablerow_close
            [1] => Array()
            [2] => 121
        )
    [33] => Array(
            [0] => table_close
            [1] => Array()
            [2] => 121
        )
    [34] => Array(
            [0] => p_open
            [1] => Array()
            [2] => 121
        )
    [35] => Array(
            [0] => cdata
            [1] => Array(
                    [0] => def

                )
            [2] => 122
        )
    [36] => Array(
            [0] => p_close
            [1] => Array()
            [2] => 122
        )
    [37] => Array(
            [0] => document_end
            [1] => Array()
            [2] => 122
        )
)

The top level array is simply a list. Each of its child elements describes a callback function to be executed against the Renderer (see description of the Renderer below) as well as the byte index in the raw input text where that particular “element” of wiki syntax was found.

A Single Instruction

Considering a single child element (which represents a single instruction) from the above list of instructions;

    [35] => Array(
            [0] => cdata
            [1] => Array(
                    [0] => def

                )
            [2] => 122
        )

The first element (index 0) is the name of a method or function in the Renderer to execute.

The second element (index 1) is itself an array, each of its elements being the arguments for the Renderer method that will be called.

In this case there is a single argument with the value "def\n", so the method call would be like;

$render->cdata("def\n");

The third element (index 2) is the byte index of the first character that “triggered” this instruction in the raw text document. It should be the same as the value returned by PHP's strpos() function. This can be used to retrieve sections of the raw wiki text, based on the positions of the instructions generated from it (example later).

Note: The Parser's parse method pads the raw wiki text with a preceding and proceeding linefeed character, to make sure particular Lexer states exit correctly, so you may need to subtract 1 from the byte index to get the correct location in the original raw wiki text. The Parser also normalizes linefeeds to Unix style (i.e. all \r\n becomes \n) so the document the Lexer sees may be smaller than the one you actually fed it.

An example of the instruction array of the syntax page can be found here

Renderer

The Renderer is a class which you define to create the output. The interface Doku_Renderer is defined in inc/parser/renderer.php and looks like;

<?php
class Doku_Renderer {
 
    // snip
 
    public function header($text, $level) {}
 
    public function section_open($level) {}
 
    public function section_close() {}
 
    public function cdata($text) {}
 
    public function p_open() {}
 
    public function p_close() {}
 
    public function linebreak() {}
 
    public function hr() {}
 
    // snip
}

It is used to document the Renderer although it could be also be extended if you wanted to write a Renderer which only captures certain calls.

The basic principle for how the instructions, returned from the parser, are used against a Renderer is similar to the notion of a SAX XML API - the instructions are a list of function / method names and their arguments. Looping through the list of instructions, each instruction can be called against the Renderer (i.e. the methods provided by the Renderer are callbacks). Unlike the SAX API, where only a few, fairly general, callbacks are available (e.g. tag_start, tag_end, cdata etc.), the Renderer defines a more explicit API, where the methods typically correspond one-to-one with the act of generating the output. In the section of the Renderer shown above, the p_open and p_close methods would be used to output the tags <p> and </p> in XHTML, respectively, while the header function takes two arguments - some text to display and the “level” of the header so a call like header('Some Title', 1) would be output in XHTML like <h1>Some Title</h1>.

Invoking the Renderer with Instructions

It is left up to the client code using the Parser to execute the list of instructions against a Renderer. Typically this will be done using PHP's call_user_func_array() function. For example;

// Get a list of instructions from the parser
$instructions = $parser->parse($rawDoc);
 
// Create a renderer
$renderer = new Doku_Renderer_xhtml();
 
// Loop through the instructions
foreach ($instructions as $instruction) {
    // Execute the callback against the Renderer
    call_user_func_array([$renderer, $instruction[0]], $instruction[1]);
}

Renderer Link Methods

The key Renderer methods for handling the different kinds of link are;

function camelcaselink($link) {} // $link like "SomePage"
- This can probably be ignored for spam checking - it shouldn't be possible for someone to link offsite with this syntax
function internallink($link, $title = null) {} // $link like "[[syntax]]"
- Although $link itself is internal, $title could be an image which is offsite, so needs checking
function externallink($link, $title = null) {}
- Both $link and $title (images) need checking
function interwikilink($link, $title = null, $wikiName, $wikiUri) {}
- The $title needs checking for images
function filelink($link, $title = null) {}
- Technically only valid file:// URLs should match but probably best to check anyway plus $title may be an offsite image
function windowssharelink($link, $title = null) {}
- Should only match valid Windows share URLs but check anyway plus $title for images
function emaillink($address, $title = null) {}
- $title could be an image. Check the email as well?
function internalmedialink($src, $title = null, $align = null, $width = null, $height = null, $cache = null) {}
- This shouldn't need check - should only link to local images. $title itself cannot be an image
function externalmedialink($src, $title = null, $align = null, $width = null, $height = null, $cache = null) {}
- $src needs checking

Special attention is required for methods which take the $title argument, which represents the visible text of the link, for example;

<a href="https://www.example.com">This is the title</a>

The $title argument can have three possible types of value;

null: no title was provided in the wiki document.
string: a plain text string was used as the title
array (hash): an image was used as the title.

If the $title is an array, it will containing associative values describing the image;

$title = [
    // Could be 'internalmedia' (local image) or 'externalmedia' (offsite image)
    'type' => 'internalmedia',
 
    // The URL to the image (may be a wiki URL or https://static.example.com/img.png)
    'src' => 'wiki:php-powered.png',
 
    // For the alt attribute - a string or null
    'title' => 'Powered by PHP',
 
    // 'left', 'right', 'center' or null
    'align' => 'right',
 
    // Width in pixels or null
    'width' => 50,
 
    // Height in pixels or null
    'height' => 75,
 
    // Whether to cache the image (for external images)
    'cache' => false,
];

Examples

The following examples show common tasks that would likely be performed with the parser, as well as raising performance considerations and notes on extending syntax.

Basic Invocation

To invoke the parser with all current modes, and parse the DokuWiki syntax document;

global $conf;
 
// Create the Handler
$handler = new Doku_Handler();
 
// Create the parser with the handler
$parser = new dokuwiki\Parsing\Parser($handler);
 
// Load all the modes
$parser->addMode('listblock', new dokuwiki\Parsing\ParserMode\ListBlock());
$parser->addMode('preformatted', new dokuwiki\Parsing\ParserMode\Preformatted());
$parser->addMode('notoc', new dokuwiki\Parsing\ParserMode\NoToc());
$parser->addMode('header', new dokuwiki\Parsing\ParserMode\Header());
$parser->addMode('table', new dokuwiki\Parsing\ParserMode\Table());
 
$formats = [
    'strong', 'emphasis', 'underline', 'monospace',
    'subscript', 'superscript', 'deleted',
];
foreach ($formats as $format) {
    $parser->addMode($format, new dokuwiki\Parsing\ParserMode\Formatting($format));
}
 
$parser->addMode('linebreak', new dokuwiki\Parsing\ParserMode\Linebreak());
$parser->addMode('footnote', new dokuwiki\Parsing\ParserMode\Footnote());
$parser->addMode('hr', new dokuwiki\Parsing\ParserMode\Hr());
 
$parser->addMode('unformatted', new dokuwiki\Parsing\ParserMode\Unformatted());
$parser->addMode('code', new dokuwiki\Parsing\ParserMode\Code());
$parser->addMode('file', new dokuwiki\Parsing\ParserMode\File());
$parser->addMode('quote', new dokuwiki\Parsing\ParserMode\Quote());
 
// These need data files. The get* functions are left to your imagination
$parser->addMode('acronym', new dokuwiki\Parsing\ParserMode\Acronym(array_keys(getAcronyms())));
// not used anymore, and unsure if getWordblocks() actually works here?
//$parser->addMode('wordblock',new dokuwiki\Parsing\ParserMode\Wordblock(getWordblocks()));
$parser->addMode('smiley', new dokuwiki\Parsing\ParserMode\Smiley(array_keys(getSmileys())));
$parser->addMode('entity', new dokuwiki\Parsing\ParserMode\Entity(array_keys(getEntities())));
 
$parser->addMode('multiplyentity', new dokuwiki\Parsing\ParserMode\Multiplyentity());
$parser->addMode('quotes', new dokuwiki\Parsing\ParserMode\Quotes());
 
$parser->addMode('camelcaselink', new dokuwiki\Parsing\ParserMode\CamelCaselink());
$parser->addMode('internallink', new dokuwiki\Parsing\ParserMode\Internallink());
$parser->addMode('media', new dokuwiki\Parsing\ParserMode\Media());
$parser->addMode('externallink', new dokuwiki\Parsing\ParserMode\Externallink());
$parser->addMode('emaillink', new dokuwiki\Parsing\ParserMode\Emaillink());
$parser->addMode('windowssharelink', new dokuwiki\Parsing\ParserMode\Windowssharelink());
$parser->addMode('filelink', new dokuwiki\Parsing\ParserMode\Filelink());
$parser->addMode('eol', new dokuwiki\Parsing\ParserMode\Eol());
 
// Loads the raw wiki document
$doc = file_get_contents($conf['datadir'] . 'wiki/syntax.txt');
 
// Get a list of instructions
$instructions = $parser->parse($doc);
 
// Create a renderer
$renderer = new Doku_Renderer_xhtml();
 
# Load data like smileys into the Renderer here

// Loop through the instructions
foreach ($instructions as $instruction) {
    // Execute the callback against the Renderer
    call_user_func_array([$renderer, $instruction[0]], $instruction[1]);
}
 
// Display the output
echo $renderer->doc;

Selecting Text (for sections)

The following shows how to select a range of text from the raw document using instructions from the parser;

global $conf;
 
// Create the Handler
$handler = new Doku_Handler();
 
// Create the parser with the handler
$parser = new dokuwiki\Parsing\Parser($handler);
 
// Load the header mode to find headers
$parser->addMode('header', new dokuwiki\Parsing\ParserMode\Header());
 
// Load the modes which could contain markup that might be
// mistaken for a header
$parser->addMode('listblock', new dokuwiki\Parsing\ParserMode\Listblock());
$parser->addMode('preformatted', new dokuwiki\Parsing\ParserMode\Preformatted());
$parser->addMode('table', new dokuwiki\Parsing\ParserMode\Table());
$parser->addMode('unformatted', new dokuwiki\Parsing\ParserMode\Unformatted());
$parser->addMode('code', new dokuwiki\Parsing\ParserMode\Code());
$parser->addMode('file', new dokuwiki\Parsing\ParserMode\File());
$parser->addMode('quote', new dokuwiki\Parsing\ParserMode\Quote());
$parser->addMode('footnote', new dokuwiki\Parsing\ParserMode\Footnote());
$parser->addMode('internallink', new dokuwiki\Parsing\ParserMode\Internallink());
$parser->addMode('media', new dokuwiki\Parsing\ParserMode\Media());
$parser->addMode('externallink', new dokuwiki\Parsing\ParserMode\Externallink());
$parser->addMode('email', new dokuwiki\Parsing\ParserMode\Emaillink());
$parser->addMode('windowssharelink', new dokuwiki\Parsing\ParserMode\Windowssharelink());
$parser->addMode('filelink', new dokuwiki\Parsing\ParserMode\Filelink());
 
// Loads the raw wiki document
$doc = file_get_contents($conf['datadir'] . 'wiki/syntax.txt');
 
// Get a list of instructions
$instructions = $parser->parse($doc);
 
// Use this to watch when we're inside the section we want
$inSection = false;
$startPos = 0;
$endPos = 0;
 
// Loop through the instructions
foreach ($instructions as $instruction) {
    if (!$inSection) {
        // Look for the header for the "Lists" heading
        if ($instruction[0] == 'header' &&
            trim($instruction[1][0]) == 'Lists') {
 
            $startPos = $instruction[2];
            $inSection = true;
        }
    } else {
        // Look for the end of the section
        if ($instruction[0] == 'section_close') {
            $endPos = $instruction[2];
            break;
        }
    }
}
 
// Normalize and pad the document in the same way the parse does
// so that byte indexes with match
$doc = "\n" . str_replace("\r\n", "\n", $doc) . "\n";
 
// Get the text before the section we want
$before = substr($doc, 0, $startPos);
$section = substr($doc, $startPos, ($endPos - $startPos));
$after = substr($doc, $endPos);

Managing Data File Input for Patterns

DokuWiki stores parts of some patterns in external data files (e.g. the smileys). Because the parsing and output of the document are now separate stages, handled by different components, a different approach is required for using this data, compared to earlier parser versions.

For the relevant modes, each accepts a plain list of elements which it builds into a list of patterns for registering with the Lexer.

For example;

// A plain list of smiley tokens...
$smileys = [
    ':-)',
    ':-(',
    ';-)',
    // etc.
];
 
// Create the mode
$smileyMode = new dokuwiki\Parsing\ParserMode\Smiley($smileys);
 
// Add it to the parser
$parser->addMode($smileyMode);

The parser is not interested in the output format for the smileys.

The other modes this applies to are defined by the classes;

dokuwiki\Parsing\ParserMode\Acronym - for acronyms
dokuwiki\Parsing\ParserMode\Wordblock - to block specific words (e.g. bad language)
dokuwiki\Parsing\ParserMode\Entity - for typography

Each accepts a list of “interesting strings” to its constructor, in the same way as the smileys.

In practice it is probably worth defining functions for retrieval of the data from the configuration files and storing the associative arrays in a static value e.g.;

function getSmileys() {
    static $smileys = null;
 
    if (!$smileys) {
        $smileys = array();
        $lines = file( DOKU_CONF . 'smileys.conf');
 
        foreach($lines as $line){
            //ignore comments
            $line = preg_replace('/#.*$/','',$line);
            $line = trim($line);
 
            if(empty($line)) continue;
 
            $smiley = preg_split('/\s+/',$line,2);
 
            // Build the associative array
            $smileys[$smiley[0]] = $smiley[1];
        }
    }
    return $smileys;
}

This function can now be used like;

// Load the smiley patterns into the mode
$smileyMode = new dokuwiki\Parsing\ParserMode\Smiley(array_keys(getSmileys()));

// Load the associate array in a renderer for lookup on output
$renderer->smileys = getSmileys();

Note: Checking for links which should be blocked is handled in a separate manner, as described below.

Testing Links for Spam

Ideally we want to be able to check for links to spam before storing a document (after editing).

This example should be viewed with caution. It makes a useful point of reference but having actually tested it since, it's very slow - probably easier just to use a simple function that is “syntax blind” but searches the entire document for links which match the blacklist. Meanwhile this example could be useful as a basis for building a 'wiki map' or finding 'wanted pages' by examining internal links. Probably best run as a cron job

This could be done by building a special Renderer that examines only the link-related callbacks and checks the URL against a blacklist.

A function is needed to load the spam.conf and bundle it into a single regex;

Recently tested this approach (single regex) against the latest blacklist from http://blacklist.chongqed.org/ and got errors about the final regex being too big. This should probably split the regex into smaller pieces and return them as an array

function getSpamPattern() {
    static $spamPattern = NULL;
 
    if (is_null($spamPattern)) {
        $lines = @file(DOKU_CONF . 'spam.conf');
        if (!$lines) {
            $spamPattern = '';
        } else {
            $spamPattern = '#';
            $sep = '';
            foreach ($lines as $line) {
                // Strip comments
                $line = preg_replace('/#.*$/', '', $line);
 
                // Ignore blank lines
                $line = trim($line);
                if (empty($line)) continue;
 
                $spamPattern .= $sep . $line;
                $sep = '|';
            }
            $spamPattern .= '#si';
        }
    }
    return $spamPattern;
}

Now we need to extend the base Renderer with one that will examine links only;

class Doku_Renderer_SpamCheck extends Doku_Renderer {
 
    // This should be populated by the code executing the instructions
    protected $currentCall;
 
    // An array of instructions that contain spam
    protected $spamFound = [];
 
    // pcre pattern for finding spam
    protected $spamPattern = '#^$#';
 
    public function internallink($link, $title = null) {
        $this->checkTitle($title);
    }
 
    public function externallink($link, $title = null) {
        $this->checkLinkForSpam($link);
        $this->checkTitle($title);
    }
 
    public function interwikilink($link, $title = null) {
        $this->checkTitle($title);
    }
 
    public function filelink($link, $title = null) {
        $this->checkLinkForSpam($link);
        $this->checkTitle($title);
    }
 
    public function windowssharelink($link, $title = null) {
        $this->checkLinkForSpam($link);
        $this->checkTitle($title);
    }
 
    public function emaillink($address, $title = null) {
        $this->checkLinkForSpam($address);
        $this->checkTitle($title);
    }
 
    public function internalmedialink($src) {
        $this->checkLinkForSpam($src);
    }
 
    public function externalmedialink($src) {
        $this->checkLinkForSpam($src);
    }
 
    protected function checkTitle($title) {
        if (is_array($title) && isset($title['src'])) {
            $this->checkLinkForSpam($title['src']);
        }
    }
 
    // Pattern matching happens here
    protected function checkLinkForSpam($link) {
        if (preg_match($this->spamPattern, $link)) {
            $spam = $this->currentCall;
            $spam[3] = $link;
            $this->spamFound[] = $spam;
        }
    }
 
    public getSpamFound() {
        return $this->spamFound;
    }
}

Note the line $spam[3] = $link; in the checkLinkForSpam() method. This adds an additional element to the list of spam instructions found, making it easy to determine what the bad URLs were (e.g. for logging).

Finally we can use this spam checking renderer like;

global $conf;
 
// Create the Handler
$handler = new Doku_Handler();
 
// Create the parser with the handler
$parser = new dokuwiki\Parsing\Parser($handler);
 
// Load the modes which could contain markup that might be
// mistaken for a link
$parser->addMode('preformatted', new dokuwiki\Parsing\ParserMode\Preformatted());
$parser->addMode('unformatted', new dokuwiki\Parsing\ParserMode\Unformatted());
$parser->addMode('code', new dokuwiki\Parsing\ParserMode\Code());
$parser->addMode('file', new dokuwiki\Parsing\ParserMode\File());
$parser->addMode('quote', new dokuwiki\Parsing\ParserMode\Quote());
 
// Load the link modes...
$parser->addMode('internallink', new dokuwiki\Parsing\ParserMode\Internallink());
$parser->addMode('media', new dokuwiki\Parsing\ParserMode\Media());
$parser->addMode('externallink', new dokuwiki\Parsing\ParserMode\Externallink());
$parser->addMode('email', new dokuwiki\Parsing\ParserMode\Emaillink());
$parser->addMode('windowssharelink', new dokuwiki\Parsing\ParserMode\Windowssharelink());
$parser->addMode('filelink', new dokuwiki\Parsing\ParserMode\Filelink());
 
// Loads the raw wiki document
$doc = file_get_contents($conf['datadir'] . 'wiki/spam.txt');
 
// Get a list of instructions
$instructions = $parser->parse($doc);
 
// Create a renderer
$renderer = new Doku_Renderer_SpamCheck();
 
// Load the spam regex
$renderer->spamPattern = getSpamPattern();
 
// Loop through the instructions
foreach ($instructions as $instruction) {
    // Store the current instruction
    $renderer->currentCall = $instruction;
    call_user_func_array([$renderer, $instruction[0]], $instruction[1]);
}
 
// What spam did we find?
echo '<pre>';
print_r($renderer->getSpamFound());
echo '</pre>';

Because we don't need all the syntax modes, checking for spam in this manner should be faster than normal parsing of a document.

Adding Substitution Syntax

Warning: the code below hasn't been tested - just an example

As a simpler task in modifying the parser, this example will add a “bookmark” tag, which can be used to create a named anchor in a document for linking in.

The syntax for the tag will be like;

BM{My Bookmark}

The string “My Bookmark” is the name of the bookmark while the rest identifies it as being a bookmark. In HTML this would correspond to;

<a name="My Bookmark"></a>

Adding this syntax requires the following steps;

Create a parser syntax mode to register with the Lexer
Update the $PARSER_MODES['substition'] subarray found at the end of inc/parser/parser.php, which is used to deliver a quick list of modes (used in classes like dokuwiki\Parsing\ParserMode\Table
Update the Handler with a method to catch bookmark tokens
Update the abstract Renderer as documentation and any concrete Renderer implementations that need it.

Creating the parser mode means extending the dokuwiki\Parsing\ParserMode\AbstractMode class and overriding its connectTo method;

namespace dokuwiki\Parsing\ParserMode;
 
class Bookmark extends AbstractMode {
    public function connectTo($mode) {
        // Allow word and space characters
        $this->Lexer->addSpecialPattern('BM\{[\w ]+\}', $mode, 'bookmark');
    }
}

This will match the complete bookmark using a single pattern (extracting the bookmark name from the rest of the syntax will be left to the Handler). It uses the Lexer addSpecialPattern() method so that the bookmark lives in its own state.

Note the Lexer does not require the start / end pattern delimiters - it takes care of this for you.

Because nothing inside the bookmark should be considered valid wiki markup, there is no reference here to other modes which this mode might accept.

Next the subarray substitution in the $PARSER_MODES array in the inc/parser/parser.php file needs updating so that the new mode called bookmark is returned in the list;

global $PARSER_MODES;
$PARSER_MODES = [
    ...
 
    // modes where the token is simply replaced - they can not contain any
    // other modes
    'substition' => [
        'acronym', 'smiley', 'wordblock', 'entity', 'camelcaselink', 
        'internallink', 'media', 'externallink', 'linebreak', 'emaillink', 
        'windowssharelink', 'filelink', 'notoc', 'nocache', 'multiplyentity', 
        'quotes', 'rss', 'bookmark'
    ],
 
    ...
];

This subarray is just to help registering these modes with other modes that accept them (e.g., lists can contain these modes - you can have a link inside a list) without having to list them in full each time they are needed.

Note: Similar subarrays exist, like protected and formatting which return different groups of modes. The grouping of different types of syntax is not entirely perfect but still useful to save lines of code.

With the syntax now described, a new method, which matches the name of the mode (i.e. bookmark) needs to be added to the Handler;

class Doku_Handler {
    // ...
 
    // $match is the string which matched the Lexers regex for bookmarks
    // $state identifies the type of match (see the Lexer notes above)
    // $pos is the byte index in the raw doc of the first character of the match
    public function bookmark($match, $state, $pos) {
 
        // Technically don’t need to worry about the state;
        // should always be DOKU_LEXER_SPECIAL or there's
        // a very serious bug
        switch ($state) {
            case DOKU_LEXER_SPECIAL:
                // Attempt to extract the bookmark name
                if (preg_match('/^BM\{(\w+)}$/', $match, $nameMatch)) {
                    $name = $nameMatch[1];
 
                    // arg0: name of the Renderer method to call
                    // arg1: array of arguments to the Renderer method
                    // arg2: the byte index as before
                    $this->addCall('bookmark', [$name], $pos);
 
                    // If the bookmark didn't have a valid name, simply pass it
                    // through unmodified as plain text (cdata)
                } else {
                    $this->addCall('cdata', [$match], $pos);
                }
                break;
        }
 
        // Must return true or the lexer will halt
        return true;
    }
 
    // ...
}

The final step is updating the Renderer (renderer.php) with a new function and implementing it in the XHTML Renderer (xhtml.php);

class Doku_Renderer {
    // ...
 
    public function bookmark($name) {}
 
    // ...
}

class Doku_Renderer_xhtml {
    // ...
 
    public function bookmark($name) {
        $name = $this->_xmlEntities($name);
 
        // id is required in XHTML while name still supported in 1.0
        echo '<a class="bookmark" name="' . $name . '" id="' . $name . '"></a>';
    }
 
    // ...
}

See the _test/tests/inc/parser/parser_replacements.test.php script for examples of how you might test this code.

Adding Formatting Syntax (with state)

Warning: the code below hasn't been tested - just an example

To show more advanced use of the Lexer, this example will add markup that allows users to change the enclosed text color to red, yellow or green.

The markup would look like;

<red>This is red</red>.
This is black.
<yellow>This is yellow</yellow>.
This is also black.
<green>This is yellow</green>.

The steps required to implement this are essentially the same as the previous example, stating with the new syntax mode, but add some additional detail as other modes are involved;

namespace dokuwiki\Parsing\ParserMode;
 
class TextColors extends AbstractMode {
 
    protected $color;
    protected $colors = ['red', 'green', 'blue'];
 
    public function __construct($color) {
        global $PARSER_MODES;
 
        // Just to help prevent mistakes using this mode
        if (!array_key_exists($color, $this->colors)) {
            trigger_error('Invalid color ' . $color, E_USER_WARNING);
        }
        $this->color = $color;
 
        // This mode accepts other modes;
        $this->allowedModes = array_merge(
            $PARSER_MODES['formatting'],
            $PARSER_MODES['substition'],
            $PARSER_MODES['disabled']
        );
 
        unset($this->allowedModes[array_search($color, $this->allowedModes)]);
    }
 
    // connectTo is called once for every mode registered with the Lexer
    public function connectTo($mode) {
        // The lookahead pattern makes sure there's a closing tag...
        $pattern = '<' . $this->color . '>(?=.*</' . $this->color . '>)';
 
        // arg0: pattern to match to enter this mode
        // arg1: other modes where this pattern may match
        // arg2: name of the this mode
        $this->Lexer->addEntryPattern($pattern, $mode, $this->color);
    }
 
    // post connect is only called once
    public function postConnect() {
        // arg0: pattern to match to exit this mode
        // arg1: name of mode to exit
        $this->Lexer->addExitPattern('</' . $this->color . '>', $this->color);
    }
 
    // if pattern belongs to two or more modes, the one with the lowest sort number wins
    public function getSort() {
        return 158;
    }
}

Some points on the above class.

It actually represents multiple modes, one for each color. The colors need separating into different modes so that </green> doesn't end up being the closing tag for <red>, for example.
These modes can contain other modes, for example <red>**Warning**</red> for bold text which is red. This is registered in the constructor for this class by assigning the accepted mode names to the allowedModes property.
the constructor unset one of the modes to prevent a formatting mode being nested inside itself (e.g. we don't want <red>A <red>warning</red> message</red> to happen).
When registering the entry pattern, it's a good idea to check the exit pattern exists (which is done with the lookahead). This should help protect users from themselves, when they forget to add the closing tag.
The entry pattern needs to be registered for each mode within which the color tags could be used. By contrast we only need one exit pattern, so this is placed in the postConnect method, so that is only executed once, after all calls to connectTo on all modes have been called.

With the parsing mode class done, the new modes now need adding to the $PARSER_MODES['formatting'] subarray in inc/parser/parser.php

global $PARSER_MODES;
$PARSER_MODES = [
    ...
    // modes for styling text -- footnote behaves similar to styling
    'formatting' => [
        'strong', 'emphasis', 'underline', 'monospace', 'subscript', 
        'superscript', 'deleted', 'footnote',
        'red','yellow','green'
    ],
    ...
];

Next the Handler needs updating with one method for each color;

class Doku_Handler {
    // ...
 
    public function red($match, $state, $pos) {
        // The nestingTag method in the Handler is there
        // to save having to repeat the same code many
        // times. It will create an opening and closing
        // instruction for the entry and exit patterns,
        // while passing through the rest as cdata
        $this->nestingTag($match, $state, $pos, 'red');
        return true;
    }
 
    public function yellow($match, $state, $pos) {
        $this->nestingTag($match, $state, $pos, 'yellow');
        return true;
    }
 
    public function green($match, $state, $pos) {
        $this->nestingTag($match, $state, $pos, 'green');
        return true;
    }
 
    // ...
}

Finally we can update the Renderers;

class Doku_Renderer {
    // ...
 
    public function red_open() {}
    public function red_close() {}
 
    public function yellow_open() {}
    public function yellow_close() {}
 
    public function green_open() {}
    public function green_close() {}
 
    // ...
}

class Doku_Renderer_xhtml {
    // ...
 
    public function red_open() {
        echo '<span class="red">';
    }
    public function red_close() {
        echo '</span>';
    }
 
    public function yellow_open() {
        echo '<span class="yellow">';
    }
    public function yellow_close() {
        echo '</span>';
    }
 
    public function green_open() {
        echo '<span class="green">';
    }
    public function green_close() {
        echo '</span>';
    }
 
    // ...
}

See the _test/tests/inc/parser/parser_i18n.test.php script for examples of how you might write unit tests for this code.

Adding Block-Level Syntax

Warning: the code below hasn't been tested - just an example

Extending the previous example, this one will create a new tag for marking up messages in the document as things still to be done. Example use might look like;

===== Wiki Quotation Syntax =====

This syntax allows

<todo>
Describe quotation syntax '>'
</todo>

Some more text

This syntax might allow a tool to be added to search wiki pages and find things that still need something doing, as well as making it stand out in the document with some eye-catching style.

What's different about this syntax is it should be displayed in a separate block in the document (e.g. inside <div/> so that it can be floated with CSS). This requires modifying the dokuwiki\Parsing\Handler\Block class, which loops through all the instructions after all tokens have been seen by the handler and takes care of adding <p/> tags.

The parser mode for this syntax might be;

namespace dokuwiki\Parsing\ParserMode;
 
class Todo extends AbstractMode {
 
    public function __construct() {
        $this->allowedModes = array_merge (
            $PARSER_MODES['formatting'],
            $PARSER_MODES['substition'],
            $PARSER_MODES['disabled']
        );
    }
 
    public function connectTo($mode) {
        $pattern = '<todo>(?=.*</todo>)';
        $this->Lexer->addEntryPattern($pattern, $mode, 'todo');
    }
 
    public function postConnect() {
        $this->Lexer->addExitPattern('</todo>','todo');
    }
 
    public function getSort() {
        return 150;
    }
 
}

This mode is then added to the container entry of $PARSER_MODES in inc/parser/parser.php;

global $PARSER_MODES;
$PARSER_MODES = [
    // containers are complex modes that can contain many other modes
    // hr breaks the principle but they shouldn't be used in tables / lists
    // so they are put here
    'container' => ['listblock', 'table', 'quote', 'hr', 'todo'],
 
    ...
];

Updating the Doku_Handler class simply requires;

class Doku_Handler {
 
    // ...
 
    public function todo($match, $state, $pos) {
        $this->nestingTag($match, $state, $pos, 'todo');
        return true;
    }
 
    // ...
 
}

But the dokuwiki\Parsing\Handler\Block class also needs updating, to register the todo opening and closing instructions;

namespace dokuwiki\Parsing\Handler;
 
class Block {
    // ...
 
    // Blocks don't contain linefeeds
    protected $blockOpen = [
        'header',
        'listu_open','listu_open', 'listo_open', 'listitem_open', 
        'listcontent_open', 'table_open', 'tablerow_open', 'tablecell_open', 
        'tableheader_open', 'tablethead_open', 'quote_open', 'code', 'file', 
        'hr', 'preformatted', 'rss', 'footnote_open'
        'todo_open'
    ];
 
    protected $blockClose = [
        'header', 'listu_close', 'listo_close', 'listitem_close', 
        'listcontent_close', 'table_close', 'tablerow_close', 'tablecell_close', 
        'tableheader_close', 'tablethead_close', 'quote_close', 'code', 'file',
        'hr', 'preformatted', 'rss', 'footnote_close'
        'todo_close'
    ];

By registering the todo_open and todo_close in the $blockOpen and $blockClose arrays, it instructs the dokuwiki\Parsing\Handler\Block class that any previous open paragraphs should be closed before entering the todo section then a new paragraph should start after the todo section. Inside the todo, no additional paragraphs should be inserted.

With that done, the Renderers can be updated;

class Doku_Renderer {
 
    // ...
 
    public function todo_open() {}
    public function todo_close() {}
 
    // ...
 
}

class Doku_Renderer_xhtml {
 
    // ...
 
    public function todo_open() {
        echo '<div class="todo">';
    }
    public function todo_close() {
        echo '</div>';
    }
 
    // ...
 
}

Serializing the Renderer Instructions

Dokuwiki uses a caching mechanism inherited from the dokuwiki\Cache\Cache. Rewrite this section.

It is possible to serialize the list of instructions output from the Handler, to eliminate the overhead of re-parsing the original document on each request, if the document itself hasn't changed.

A simple implementation of this might be;

global $conf;
 
$filename = $conf['datadir'] . 'wiki/syntax.txt';
$cacheId = $conf['cachedir'] . $filename . '.cache';
 
// If there's no cache file or it's out of date
// (the original modified), get a fresh list of instructions
if (!file_exists($cacheId) || (filemtime($filename) > filemtime($cacheId))) {
    // Create the Handler
    $handler = new Doku_Handler();
 
    // Create the parser with the handler
    $parser = new dokuwiki\Parsing\Parser($handler);
 
    // Load all the modes
    $parser->addMode('listblock', new dokuwiki\Parsing\ParserMode\ListBlock());
    $parser->addMode('preformatted', new dokuwiki\Parsing\ParserMode\Preformatted());
    $parser->addMode('notoc', new dokuwiki\Parsing\ParserMode\NoToc());
    $parser->addMode('header', new dokuwiki\Parsing\ParserMode\Header());
    $parser->addMode('table', new dokuwiki\Parsing\ParserMode\Table());
 
    // etc. etc.
 
    $instructions = $parser->parse(file_get_contents($filename));
 
    // Serialize and cache 
    $sInstructions = serialize($instructions);
 
    if ($fh = @fopen($cacheId, 'a')) {
        if (fwrite($fh, $sInstructions) === false) {
            die("Cannot write to file ($cacheId)");
        }
 
        fclose($fh);
    }
} else {
    // Load the serialized instructions and unserialize
    $sInstructions = file_get_contents($cacheId);
    $instructions = unserialize($sInstructions);
}
 
$renderer = new Doku_Renderer_xhtml();
 
foreach ($instructions as $instruction) {
    call_user_func_array(
        array($renderer, $instruction[0]), $instruction[1]
    );
}
 
echo $renderer->doc;

Note this implementation is not complete. What happens if someone modifies one of the smiley.conf files to add a new smiley, for example? The change will need to trigger updating the cache, so that the new smiley is parsed. Some care over file locking (or the renaming trick) may also be also be required.

Serializing the Parser

connecting modes is protected and already called in parse(), but that is outside the cache here. Probably not useful anymore. If still usefull, rewrite with dokuwiki\Cache\Cache

Similar to the above example, it is also possible to serialize the Parser itself, before parsing begins. Because setting up the modes carries a fairly high overhead, this can add a small increase in performance. From loose benchmarking, parsing the wiki:syntax page on a single (slow!) system, what taking around 1.5 seconds to finish without serializing the Parser and about 1.25 seconds with the a serialized version of the Parser.

In brief it can be implemented something like;

global $conf;
 
$cacheId = $conf['cachedir'] . 'parser.cache';
if (!file_exists($cacheId)) {
    // Create the parser with the handler
    $handler = new Doku_Handler();
    $parser = new dokuwiki\Parsing\Parser($handler);
 
    // Load all the modes
    $parser->addMode('listblock',new dokuwiki\Parsing\ParserMode\ListBlock());
    $parser->addMode('preformatted',new dokuwiki\Parsing\ParserMode\Preformatted());
    # etc.

    // Serialize
    $sParser = serialize($parser);
 
    // Write to file
    if ($fh = @fopen($cacheId, 'a')) {
        if (fwrite($fh, $sParser) === false) {
            die("Cannot write to file ($cacheId)");
        }
        fclose($fh);
    }
 
} else {
    // Otherwise load the serialized version
    $sParser = file_get_contents($cacheId);
    $parser = unserialize($sParser);
}
$doc = ...
$parser->parse($doc);

Some implementation notes which aren't covered above;

Should use some file locking when writing to the cache (or else create with different name then rename) otherwise a request may receive a partially complete cache file, if read while writing still in progress
What to do if one of the *.conf files is updated? Need to flush the cache.
May be different versions of the Parser (e.g. for spam checking) so use different cache IDs

Testing

See unittesting for setup and details.

For the DokuWiki parser, tests have been provided for all the syntax implemented and I strongly recommend writing new tests if additional syntax is added.

Some notes / recommendations;

Re-run the tests every time you change something in the parser - problems will often surface immediately saving lots of time.
They only test specific cases. They don't guarantee there's no bugs only that those specific cases are working properly.
If bugs are found, write a test for that bug while fixing it (better yet, before fixing it), to prevent it recurring.

Bugs / Issues

Some things off the top of my head. move to bug tracker?

Order of adding modes important

Haven't entirely nailed down the “rules” on this one but the order in which modes are added is important (and the Parser doesn't check this for you). In particular, the eol mode should be loaded last, as it eats linefeed characters that may prevent other modes like lists and tables from working properly.

In general recommend loading the modes in the order used in the first example here.

From what I have worked out, order is only important if two or more modes have patterns which can be matched by the same set of characters - in which case the mode with the lowest sort number will win out. A syntax plugin can make use of this to replace a native DokuWiki handler, for an example see code plugin — ChrisS 2005-07-30

Change to Wordblock

Originally the wordblock functionality was for match link URLs against a blacklist. This has been changed. The “wordblock” mode is used for matching things like rude words, fuck it. For prevent spam URLs, probably best to use the example above.

One recommendation here - the conf/wordblock.conf file should be renamed conf/spam.conf, containing the URL blacklist. A new file conf/badwords.conf contains a list of rude words to censor.

Weakest Links

From the point of view of design, the worst parts of the code are in inc/parser/handler.php, namely the “re-writing” classes;

Doku_Handler_List (inline re-writer)
Doku_Handler_Preformatted (inline re-writer)
Doku_Handler_Quote (inline re-writer)
Doku_Handler_Table (inline re-writer)
Doku_Handler_Section (post processing re-writer)
Doku_Handler_Block (post processing re-writer)
Doku_Handler_Toc (post processing re-writer)

The “inline re-writers” are used while the Handler is still receiving tokens from the Lexer while the “post processing re-writers” are invoked from Doku_Handler::__finalize() and loop once through the complete list of instructions the Handler has created (which has a performance overhead).

It may be possible to eliminate Doku_Handler_List, Doku_Handler_Quote and Doku_Handler_Table by using multiple lexing modes (each of these currently uses only a single mode).

Also it may be possible to change Doku_Handler_Section and Doku_Handler_Toc to being “inline re-writers”, triggered by header tokens received by the Handler.

The most painful is the Doku_Handler_Block class, responsible for inserting paragraphs into the instructions. There may be a value in inserting further abstractions to make it easier to maintain but, in general, can't see a way to eliminate it completely and there's probably some bugs there which have yet to be found.

Greedy Tags

Consider the following wiki syntax;

Hello <sup>World 

----

<sup>Goodbye</sup> World

The user forgot to close the first <sup> tag.

The result is;

Hello ^{World

—-

<sup>Goodbye} World

The first <sup> tag is being too greedy in checking for its entry pattern.

This applies to all similar modes. The entry patterns currently check for that the closing tag exists somewhere but should also check that a second opening tag of the same sort was not found first.

Footnote across list

There's one failing test in the test suite to document this problem. In essence, if a footnote is closed across multiple list items, it can have the effect of producing an opening footnote instruction without the corresponding closing instruction. The following is an example of syntax that would cause this problem;

  *((A))
    *(( B
  * C ))

For the time being users will have to fix pages where this has been done. The solution is to split list tokenization into multiple modes (currently there is only a single mode listblock for lists).

Linefeed grabbing

261

NOTE: See this GitHub issue for further details and a possible workaround.

Because the header, horizontal rule, list, table, quote and preformatted (indented text) syntax relies on linefeed characters to mark their starts and ends, they require regexes which consume linefeed characters. This means users need to add an additional linefeed if a table appears immediately after a list, for example.

Given the following wiki syntax;

Before the list
  - List Item
  - List Item
| Cell A | Cell B |
| Cell C | Cell D |
After the table

It produces;

Before the list

List Item
List Item

| Cell A | Cell B |

Cell C

Cell D

After the table

Notice that the first row of the table is treated as plain text.

To correct this the wiki syntax must have an additional linefeed between the list and the table (which could also contain text);

Before the list
  - List Item
  - List Item

| Cell A | Cell B |
| Cell C | Cell D |
After the table

Which looks like;

Before the list

List Item
List Item

Cell A	Cell B
Cell C	Cell D

After the table

Without scanning the text multiple times (some kind of “pre-parse” operation which inserts linefeeds), can't see any easy solutions here.

Lists / Tables / Quote Issue

For list, table and quote syntax, there is a possibility of child syntax eating multiple “lines”. For example a table like;

| Cell A | <sup>Cell B |
| Cell C | Cell D</sup> |
| Cell E | Cell F |

Produces;

Cell A	^{Cell B \| \| Cell C \| Cell D}
Cell E	Cell F

Ideally this should be rendered like;

Cell A	<sup>Cell B
Cell C	Cell D</sup>
Cell E	Cell F

i.e. the opening <sup> tag should be ignored because it has no valid closing tag.

Fixing this will requiring using multiple modes inside tables, lists and quotes.

Footnotes and blocks

Inside footnotes paragraph blocks are ignored and the equivalent of a <br/> instruction is used instead, to replace linefeeds. This is basically a result of the Doku_Handler_Block being awkward to maintain. Further to this, if a table, list, quote or horizontal rule is used inside a footnote, it will trigger a paragraph.

This should be fixed by modifying Doku_Handler_Block but recommend an overhaul of the design before doing so.

Headers

Currently headers can reside on the same line as other preceding text. This is a knock on effect from the “Linefeed grabbing” issue described above and would require some kind of “pre parse” to fix it. For example;

Before the header
Some text == Header ==
After the header

If the behaviour is to be the same as the original DokuWiki parser, this should really be interpreted as;

Before the header Some text == Header == After the header

But in fact will result in;

Before the header Some text

After the header

Block / List Issue

There is a problem if, before a list there is a blank line with two spaces, the whole including the list will be interpreted as a block:

* list item
* list item 2

TODO

Some things that probably need doing.

More State to State Closing Instructions

May be useful, for rendering other formats than XHTML, to add things like the indentation level to closing list instructions, etc.

why not just “render” to XML, and than apply some xslt/xml parsers on it?

Table / List / Quote sub modes

Lexer with multiple modes to prevent the issues with nesting states.

Discussion

Enhance the Parser with Tree Structure

The parser is quite simple, because it is only a RegEx and flat list based parser. This makes the parser weak against code errors, and difficult to create correct xhtml specially in case of nested codes. To enhance the parser, it should generate a tree structure instead of a simple list. This allows too correct errors in text code, generate correct transitional xhtml code and maybe (only maybe…) save time or memory. Many issues with errors could by corrected too (see above, and DokuWiki has problems with big tables). As example, this makes it possible to send a “p_open” “p_close” blind to the renderer, and the renderer only generates code if there is not already a open or close P tag, and can close tags if they are forgotten, or delete unneeded empty open P tags, or close them before tags like TABLE or H2… (Or better use “New Paragraph Node” instead of P). Even a syntax check and corrections could be possible. a class based tree code you find here, if the idea is interesting for the developers of DokuWiki (but maybe there are even better codes): Tree. MediaWiki use a tree parser, but with Tree it is possible to create a simpler easier way to realize this. It is easy to pack this simple code in one php file and implement this inside the inc. For a parser it needs a “nested, from inside out” search algorithm, but this is quick done. And tree could enhance even the form.php code, and plugins could use it too.

The parsed structure then could look like this, and with this tree it is easy to insert the xhtml tags (Open Close Tags) in a code generator:

TAG-H1
  CDATA: text
  NEWLINE, AMO=2
  CDATA: text
  CDATA: text
  TAG-B
    CDATA: text
  CDATA: text

  LIST, TYPE=1
    CDATA: text
    LIST
      CDATA: text
    CDATA: text
  LIST
    CDATA: text

  TABLE
    TABLEROW
      TABLECELL, BIND=2
        CDATA: text
      TABLECELL
        CDATA: text

  CDATA: text

But, it would need a heavy and work intensive redesign of the parser. A tree parser is much stronger but more difficult to realize…… And the handling of plugins could be dificult too…

Optimization of a List-Parsers

Optimization of a list-parsers (instead of using a tree-parser).
A list-parser can be optimized with a block-stack witch allows proofing of already open or close blocks. This allows the parser more intelligence and error corrections, even if it is less elegant then a tree-parser it allows to produce more accurate html-code.
For example:
You hat open a few blocks an the stack then lock like this “div;p;b”.
Now you close a “i” tag witch not exist in the stack, then you simply not but a close-tag </i>.
Or you have a stack like this “div;p” an now you send a “p-open”, then you do nothing. (Or you close the “p” first and open a other. Not recommended.)
Or you have a stack like this “div;p;b;i” an now you send a “b-close”, then “i” has to be closed first. Or if you have “div;p;b;i” and you send a “p-close”, “i” and “b” would be closed first.
Or you have a stack like this “div;p” and you send a “table-open”, because “table” can not exist inside “p”, you close “p” first, then you open “table”, and then if a “table-close” is send, you remember that “p” was closed automatically so you reopen “p” again.
And at the finalization of the parser, all still open tags can be closed.
If you insert even Content information like “div;CONTENT;p;b” and you send a “b-close” or “p-close” you can delete “b” or “p, b” instead of creating a empty “<p><b></b></p>”.
The stack allows too to jump directly to the start-position of a tag. Useful for example if you want place the section edit button directly after the header, and therefore jump to the start-tag wen setting the close-tag and insert the code on the start-position.
(for developers, logging of tag-corrections can help by debugging).

This allows a plugin-developter for example to send a “p-close” blind to the parser to make sure it is close, put the parser only generate it if necessary. (Some format-plugins have issues with not open/close or double open/close tags).
And it allows the plugin too to see on the stack witch tags are already open and can react accordingly or get informations of the start-tag.
It would be anyway better if plugindevelopers don't generate there one tag-code but use the parser instead and take so advantage of the stack structure of it. (Maybe the parser then has to allow more access to parameters or allow a $more parameter to extend the tag whit additional parameters(?).) And a free customizable “tag-open” “tag-close” (“tag-content”?) with allows to define the tag-name it self could be useful too.
(Principally even plugins could use for there blocks the stack, like “div;p;PLUGIN1”, or “h1;SECEDIT;div;p”.)

Even if the parser not understand and corrects all the html-rules, the parser wold be more stable and produces more accurate code. It only have to correct the “p” “table” and simple “b i u …” behavior correctly and it would be a useful improvement.

¹⁾

Lexer refers to the class dokuwiki\Lexer\Lexer and the contents of the file inc/Parsing/Lexer/Lexer.php

²⁾

scan: reading a PHP string from start to end

³⁾

the term token in this document refers to a regex match, made by the Lexer, and the corresponding method call on the Handler

⁴⁾

Handler refers to the class Doku_Handler and the contents of the file inc/parser/handler.php

⁵⁾

the sequence of instructions is stored in an array called $calls, which is a property of the Handler. It is intended for use with call_user_func_array

⁶⁾

Parser refers to the class dokuwiki\Parsing\Parser and the contents of the file inc/Parsing/Parser.php

⁷⁾

Render refers to some class implementing Doku_Renderer - see inc/parser/renderer.php and inc/parser/xhtml.php

⁸⁾

The terms “state” and “mode” are used somewhat interchangeably when talking about the Lexer here

⁹⁾

The notion of being “badly formed” is not applicable to the DokuWiki parser - it is designed to prevent issues where a user forgets to add the closing tag on some markup by ignoring the markup completely

¹⁰⁾

simply calling it list results in a PHP parse error because list is a PHP keyword - so the parser has to use listblock

Table of Contents

The DokuWiki Parser

Overview

Lexer

The need for state

Lexer Modes

The Lexer API

Constructor

addEntryPattern / addExitPattern

addPattern

addSpecialPattern

mapHandler

Subpatterns Not Allowed

Syntax Errors and State

Handler

Handler Token Methods

Token Conversion

Parser

Instructions Data Format

A Single Instruction

Renderer

Invoking the Renderer with Instructions

Renderer Link Methods

Examples

Basic Invocation

Selecting Text (for sections)

Managing Data File Input for Patterns

Testing Links for Spam

Adding Substitution Syntax

Adding Formatting Syntax (with state)

Adding Block-Level Syntax

Serializing the Renderer Instructions

Serializing the Parser

Testing

Bugs / Issues

Order of adding modes important

Change to Wordblock

Weakest Links

Greedy Tags

Footnote across list

Linefeed grabbing

Lists / Tables / Quote Issue

Footnotes and blocks

Headers

Header

Block / List Issue

TODO

More State to State Closing Instructions

Table / List / Quote sub modes

Discussion

Enhance the Parser with Tree Structure

Optimization of a List-Parsers