It's better when it's simple

User Tools

Site Tools


DocSearch Plugin

Compatible with DokuWiki

  • 2024-02-06 "Kaos" unknown
  • 2023-04-04 "Jack Jackrum" unknown
  • 2022-07-31 "Igor" unknown
  • 2020-07-29 "Hogfather" yes

plugin Search through your uploaded documents

Last updated on
Conflicts with

Similar to elasticsearch, searchtext

Tagged with search

Needed for docsearchsitemap

This plugin allows you to search through your uploaded documents. It is integrated into the default DokuWiki search. Just fill in a search string and start to search.

:!: A probably better alternative to this plugin, is the elasticsearch Plugin with its ability to index documents.

A CosmoCode Plugin

Download and Installation

Search and install the plugin using the Extension Manager. Refer to Plugins on how to install plugins manually.



To create the search index you have to set up a cronjob (or scheduled task under windows) that runs the dokuwiki/lib/plugins/docsearch/cron.php. You can also use online cron job service to trigger the script and a tutorial at

The search just finds documents in the index. If you create the index, upload a new file and search for the file, you won't find it until you rebuild the index.

It is possible that you need to increase the memory_limit from your PHP configuration. See ini.core

:!: Because docsearch uses the cron.php as a CLI-PHP (Command Line) script, you have to increase memory_limit in /etc/php5/cli/php.ini Joachim 10.01.2011

Note: if you run a DokuWiki farm, you need to run the cronjob for each animal seperately, passing the animal's name as first parameter to the script.


To configure the search you have to edit the dokuwiki/lib/plugins/docsearch/conf/converter.php.

Use this file to setup the document to text converter.

The plugin tries to convert every media document to a text file. On this progress it uses a given set of external tools to convert it. These tools are defined per file extension. The config stores one extension and its tool per line. You can use %in% and %out% for the input and output file.

the abstract syntax is

fileextension /path/to/converter -with_calls_to_convert --from inputfile --to outputfile
:!: you can use the ConfManager Plugin to edit the config

example config for pdf and doc:

#<?php die(); ?>
pdf /usr/bin/pdftotext -enc UTF-8 %in% %out%
doc /usr/bin/antiword %in% > %out%
odt /usr/bin/odt2txt %in% --output=%out%

the first line disallows users to browse this file with a browser.

the second line says the PDF extension, the path to the converter with the two parameters %in% and out.

the third line covers doc documents. Antiword just print the text to the stdout so we use the > to get the text into a file.

you have to ensure that the output file is UTF-8 encoded. Otherwise you might get trouble with non-ASCII characters.


  • Allow the user to use the DokuWiki indexer to index the documents.
  • Index documents that have been modified or changed only. → Skip already indexed documents for performance reasons.

Conversion settings

Office documents

Using jodconverter and

I would like to share some conversion settings which worked for me

I am using the jodconverter together with openoffice in headless mode and the following settings:

doc <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
docx <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
odt <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%

The calc formats ods, xls, xlsx can be converted by using scripts to convert them to .csv first using jodconverter and then rename them to .txt. Unfortunately only the first spreadsheet gets converted when output is csv. Using PDF conversion all spreadsheets including their names get converted (tested only for ods).

Unfortunately the jodconverter does not convert ppt or pptx directly to txt. It would be possible to convert them first to a PDF and run the the pdftotxt converter afterward but I don't like the overhead of such a chained solution. Are there any free command line tools out there to convert the mentioned format on a Linux machine?

HINT:When using in headless mode. Make sure you have enough memory. Otherwise it can crash and the indexing of all following documents will fail → jodconverter complains that it can not connect to the server.

Using jodconverter and a script

ppt                             <path to>/ %in% %out%
pptx                            <path to>/ %in% %out%
odp                             <path to>/ %in% %out%
xls                             <path to>/ %in% %out%
xlsx                            <path to>/ %in% %out%
ods                             <path to>/ %in% %out%

Here is the bash script I am using to do a chained conversion because jodconverter can not convert them directly to txt files. First to PDF and then to txt. Comments welcome since I am no bash guru…
# Converter script to convert almost everything openoffice can read to txt using the jodconverter
# and the pdf2txt tool
# Because the jodconverter can not convert files formats like ppt, pptx, xls, ods, xlsx to txt directly,
# a conversion to PDF is performed first using the jodconvert. The second step is a conversion from
# PDF to txt using the pdftotxt commandline tool
# usage: <inputfile> <outputfile>
# <inputfile> is a arbitrary file open office can read (with correct file extension!)
# <outputfile> is the filename the result should go to. (txt as file extension)
# adapt the settings below to your own needs
echo "Input: $1"
#jodconverter binary cmd
#pdf2txt binary cmd (find out your path using the 'which pdftotxt' cmd)
#your java cmd
#temporary folder for storing the PDF (path without trailing /)(you need to have write access here!)
#extract input name
input_filename_w_ext=$(basename "$input_fullfile")
#first conversion to PDF:
$JAVA_CMD -jar $JODCONVERTER_CMD "$input_fullfile" "$tmpfile"
#second conversion to txt:
$PDF2TXT_CMD "$TMP_FOLDER/$input_filename_wo_ext.pdf" "$2"
#remove tmp file
rm -f $tmpfile

An Alternative to openoffice is to use Apache Tika Example:

/usr/bin/java -jar /path/to/apache-tika/tika-app-x.xx.jar --text %in% > %out%

Mindmaps from FreeFind

Using xsl transformation

To convert files generated from FreeFind (.mm) to text one can use a xslt transformation with a xsl document provided by FreeFind ( I took mm2csv.xsl from FreeFind 0.9beta which worked well on files generated with 0.8.1)

mm                              <path to converter script>/ %in% %out%

Here is the little script which uses xmlstarlet to apply the xsl document to the FreeFind file
# Converter script to convert mindmaps generated by FreeFind to txt
# The conversion is done by a xsl definition and the commandline tool xmlstarlet
# The used xsl file "mm2csv.xsl" can be found inside the FreeFind 0.9 (beta) archive
# at the folder accessories which can be downloaded at
#Full path to the xsl file
#Full path to the commandline converter xmlstarlet

ZIP Files

For zip files this little script can be used. The command line tools for the conversion need to be added for each document type. The known document types get extracted to a temp folder where they are converted to txt and joined to one big text file which can be indexed.

Currently only conversion tools are supported which have the following style: <cmd> <inputfile> <outputfile>
# This is a converter script to convert the content from a zip file to a single txt file.
# All files which extensions are defined in this script get unzipped, converted to text and joined to one single output file
# usage: <inputfile> <outputfile>                                                                               
#adapt this:
#Folder where the zip file is unpacked WARNING: DO NOT USE THIS FOLDER FOR ANYTHING ELSE -> all files in there will be converted!
#File which is used as a temporary storage
#commands needed for this script                                                                  
#extent the extention and command  array for your personal needs
#note: the first parameter of the cmd must be the input, the second is the output filename. e.g. /opt/ <inputfile> <outputfile>
FILEEXT[0]="doc"; CMD[0]="/opt/"                                                                                          
FILEEXT[1]="pdf"; CMD[1]="/usr/bin/pdftotext"                                                                                          
#IO definitions
#generate filter string from FILEEXT
for ext in "${FILEEXT[@]}"          
  filter="$filter *.$ext"
#Unzip only content into TMPFOLDER with known extensions, ignoring case sensitivity of filter "-C",
# The  "-P \n" is needed to tell unzip that we do not have a valid password so it does not ask on stdin
# if a file is encrypted                                                                               
$UNZIP_CMD -o -qq -C -P \n $1$filter -d $TMPFOLDER  
#put all filenames into an array which are inside the TMPFOLDER.
#Whitespaces in filenames are handled correctly (from
unset filenames i
while IFS= read -r -d '' file; do
#  echo "File: ${filenames[i-1]}"
done < <($FIND_CMD $TMPFOLDER -type f -print0)
#switch off case sensitivity
shopt -s nocasematch
#convert each file to txt according the command set in CMD
for file in "${filenames[@]}"
echo "Working on file: $file"
  #get fileextention
  input_filename_w_ext=$(basename "$file")
  #search extension in FILEEXT array (case insensitive)
  # get length of an array
  for (( i=0; i<${tLen}; i++ ));
    if [[ ${FILEEXT[$i]} = $input_extension ]]
      rm -f $TMPFILE #make sure it is empty
      #execute conversion cmd
      echo ${CMD[$i]} "$file" "$TMPFILE"
      ${CMD[$i]} "$file" "$TMPFILE"
      #append $TMPFILE to output file $outputfile
      cat $TMPFILE >> $outputfile
#switch on case sensitivity
shopt -u nocasematch
#remove all stuff in the temp folder and the temp file
rm -rf $TMPFOLDER/*
rm -f $TMPFILE

WARNING: Because this script joins all content found in the zip file to one huge text file, the indexing process (PHP) will need a lot of memory! You better dump the output of this conversion script to a logfile and check it on a regular basis or for errors! To increase the memory have a look at the tips at the top of the page. I had to set the memory limit of PHP to 250 MB because the generated txt file from this script was 8.8 MByte in size. This can happen very easy if a zip file contains a lot of PDF documents!

Installation in Windows 2003

I run Dokuwiki for our company's intranet on a Windows 2003 server with XAMPP. The following short description explains how I got docsearch to run on this system.


For me, the following converters worked:

Install them in an appropriate location, i.e. C:\TOOLS and adjust the converter.php-file (replace the [PATH TO] with your actual path, i.e. C:\TOOLS\XPDF and C:\TOOLS\CATDOC).

pdf    [PATH TO]\pdftotext.exe %in% %out%
doc    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
xls    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
ppt    [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out%
docx    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
xlsx    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
pptx    [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out%
xlt    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
dot    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%

For catdoc, -a -s koi8-u -d koi8-u means

  • -a: output is in ascii-format
  • -s koi8-u: source (input) is in utf-8 unicode format (which should be the case for Fffice documents)
  • -d koi8-u: destination (output) is in utf-8 unicode format


Instead of a cronjob, set up a scheduled task in Windows to index new files. To do this, go to Start→Programs→Accessories→System Tools→Scheduled Tasks, set up a new task and enter the following into the “run” field: “[PATH TO]\php.exe [PATH TO]\cron.php” (the cron.php file is in a subdirectory of the docsearch plugin).

Example: The “run” field should contain something like this:

C:\Programs Files\php\php.exe C:\Website\dokuwiki\lib\plugins\docsearch\cron.php

or with XAMPP as a server environment

C:\xampp\php\php.exe C:\xampp\htdocs\dokuwiki\lib\plugins\docsearch\cron.php

The rest of the setup should be straight forward.


You may get a “file not found” or “path not found” error under cron.php when using some utilities or command line expressions in the converter.php file. This is due to the path slashes not being formatted to DOS/Windows format.

To fix this, insert this code around line 87 in cron.php after $cmd = str_replace('%out%', escapeshellarg($out), $cmd);:

if (strtoupper(substr(PHP_OS, 0, 3)) === 'WIN') {
		$cmd = str_replace('/', '\\', $cmd);
plugin/docsearch.txt · Last modified: 2024-01-08 17:05 by Aleksandr

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki