Table of Contents
DocSearch Plugin
Compatible with DokuWiki
- 2022-07-31 "Igor" unknown
- 2020-07-29 "Hogfather" yes
- 2018-04-22 "Greebo" unknown
- 2017-02-19 "Frusterick Manners" unknown
Similar to elasticsearch
Needed for docsearchsitemap
This plugin allows you to search through your uploaded documents. It is integrated into the default DokuWiki search. Just fill in a search string and start to search.
A probably better alternative to this plugin, is the elasticsearch Plugin with its ability to index documents.
Download and Installation
Search and install the plugin using the Extension Manager. Refer to Plugins on how to install plugins manually.
Changes
- Merge pull request #33 from dokuwiki-translate/lang_update_581_167770… (2023-03-01 21:58)
- translation update (2023-03-01 21:20)
- Merge pull request #30 from dokuwiki-translate/lang_update_680_151403… (2018-01-03 10:30)
- translation update (2017-12-23 13:35)
- Merge pull request #29 from dokuwiki-translate/lang_update_386 (2017-05-24 07:12)
- translation update (2017-05-23 22:30)
- Merge pull request #28 from dokuwiki-translate/lang_update_213 (2016-12-22 12:41)
- translation update (2016-12-22 05:50)
Cronjob
To create the search index you have to set up a cronjob (or scheduled task under windows) that runs the dokuwiki/lib/plugins/docsearch/cron.php
. You can also use online cron job service https://www.easycron.com to trigger the script and a tutorial at https://www.easycron.com/cron-job-tutorials/how-to-set-up-cron-job-for-dokuwiki-docsearch.
The search just finds documents in the index. If you create the index, upload a new file and search for the file, you won't find it until you rebuild the index.
It is possible that you need to increase the memory_limit
from your PHP configuration. See ini.core
Because docsearch uses the cron.php as a CLI-PHP (Command Line) script, you have to increase memory_limit in /etc/php5/cli/php.ini Joachim 10.01.2011
Note: if you run a DokuWiki farm, you need to run the cronjob for each animal seperately, passing the animal's name as first parameter to the script.
Configuration
To configure the search you have to edit the dokuwiki/lib/plugins/docsearch/conf/converter.php
.
Use this file to setup the document to text converter.
The plugin tries to convert every media document to a text file. On this progress it uses a given set of external tools to convert it. These tools are defined per file extension. The config stores one extension and its tool per line. You can use %in% and %out% for the input and output file.
the abstract syntax is
fileextension /path/to/converter -with_calls_to_convert --from inputfile --to outputfile
|
---|
example config for pdf and doc:
#<?php die(); ?> pdf /usr/bin/pdftotext -enc UTF-8 %in% %out% doc /usr/bin/antiword %in% > %out% odt /usr/bin/odt2txt %in% --output=%out%
the first line disallows users to browse this file with a browser.
the second line says the PDF extension, the path to the converter with the two parameters %in% and out.
the third line covers doc documents. Antiword just print the text to the stdout so we use the > to get the text into a file.
you have to ensure that the output file is UTF-8 encoded. Otherwise you might get trouble with non-ASCII characters.
Todo
- Allow the user to use the DokuWiki indexer to index the documents.
- Index documents that have been modified or changed only. → Skip already indexed documents for performance reasons.
Conversion settings
Office documents
Using jodconverter and OpenOffice.org
I would like to share some conversion settings which worked for me
I am using the jodconverter together with openoffice in headless mode and the following settings:
- converter.php
doc <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out% docx <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out% odt <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
The calc formats ods, xls, xlsx can be converted by using scripts to convert them to .csv first using jodconverter and then rename them to .txt. Unfortunately only the first spreadsheet gets converted when output is csv. Using PDF conversion all spreadsheets including their names get converted (tested only for ods).
Unfortunately the jodconverter does not convert ppt or pptx directly to txt. It would be possible to convert them first to a PDF and run the the pdftotxt converter afterward but I don't like the overhead of such a chained solution. Are there any free command line tools out there to convert the mentioned format on a Linux machine?
HINT:When using OpenOffice.org in headless mode. Make sure you have enough memory. Otherwise it can crash and the indexing of all following documents will fail → jodconverter complains that it can not connect to the OpenOffice.org server.
Using jodconverter OpenOffice.org and a script
- converter.php
ppt <path to office2txt.sh>/office2txt.sh %in% %out% pptx <path to office2txt.sh>/office2txt.sh %in% %out% odp <path to office2txt.sh>/office2txt.sh %in% %out% xls <path to office2txt.sh>/office2txt.sh %in% %out% xlsx <path to office2txt.sh>/office2txt.sh %in% %out% ods <path to office2txt.sh>/office2txt.sh %in% %out%
Here is the bash script I am using to do a chained conversion because jodconverter can not convert them directly to txt files. First to PDF and then to txt. Comments welcome since I am no bash guru…
- office2txt.sh
#!/bin/bash # Converter script to convert almost everything openoffice can read to txt using the jodconverter # and the pdf2txt tool # Because the jodconverter can not convert files formats like ppt, pptx, xls, ods, xlsx to txt directly, # a conversion to PDF is performed first using the jodconvert. The second step is a conversion from # PDF to txt using the pdftotxt commandline tool # usage: all2text.sh <inputfile> <outputfile> # <inputfile> is a arbitrary file open office can read (with correct file extension!) # <outputfile> is the filename the result should go to. (txt as file extension) # # adapt the settings below to your own needs echo "Input: $1" #jodconverter binary cmd JODCONVERTER_CMD=/opt/jodconverter/lib/jodconverter-cli-2.2.2.jar #pdf2txt binary cmd (find out your path using the 'which pdftotxt' cmd) PDF2TXT_CMD=/usr/bin/pdftotext #your java cmd JAVA_CMD=/usr/bin/java #temporary folder for storing the PDF (path without trailing /)(you need to have write access here!) TMP_FOLDER=/tmp/pdftmp #extract input name input_fullfile=$1 input_filename_w_ext=$(basename "$input_fullfile") input_extension=${input_filename_w_ext##*.} input_filename_wo_ext=${input_filename_w_ext%.*} #first conversion to PDF: tmpfile=$TMP_FOLDER/$input_filename_wo_ext".pdf" $JAVA_CMD -jar $JODCONVERTER_CMD "$input_fullfile" "$tmpfile" #second conversion to txt: $PDF2TXT_CMD "$TMP_FOLDER/$input_filename_wo_ext.pdf" "$2" #remove tmp file rm -f $tmpfile
An Alternative to openoffice is to use Apache Tika http://tika.apache.org/ Example:
/usr/bin/java -jar /path/to/apache-tika/tika-app-x.xx.jar --text %in% > %out%
Mindmaps from FreeFind
Using xsl transformation
To convert files generated from FreeFind (.mm) to text one can use a xslt transformation with a xsl document provided by FreeFind ( I took mm2csv.xsl from FreeFind 0.9beta which worked well on files generated with 0.8.1)
- converter.php
mm <path to converter script>/mm2txt.sh %in% %out%
Here is the little script which uses xmlstarlet to apply the xsl document to the FreeFind file
- mm2txt.sh
#!/bin/bash # Converter script to convert mindmaps generated by FreeFind to txt # The conversion is done by a xsl definition and the commandline tool xmlstarlet # The used xsl file "mm2csv.xsl" can be found inside the FreeFind 0.9 (beta) archive # at the folder accessories which can be downloaded at http://freemind.sourceforge.net #Full path to the xsl file XSL_FILE=/opt/mm2csv.xsl #Full path to the commandline converter xmlstarlet XML_STARLET=/usr/bin/xmlstarlet #conversion $XML_STARLET tr $XSL_FILE $1 > $2
ZIP Files
For zip files this little script can be used. The command line tools for the conversion need to be added for each document type. The known document types get extracted to a temp folder where they are converted to txt and joined to one big text file which can be indexed.
Currently only conversion tools are supported which have the following style: <cmd> <inputfile> <outputfile>
- zip2txt.sh
#!/bin/bash # This is a converter script to convert the content from a zip file to a single txt file. # All files which extensions are defined in this script get unzipped, converted to text and joined to one single output file # usage: zip2txt.sh <inputfile> <outputfile> #adapt this: #Folder where the zip file is unpacked WARNING: DO NOT USE THIS FOLDER FOR ANYTHING ELSE -> all files in there will be converted! TMPFOLDER="/tmp/zipconverter" #File which is used as a temporary storage #DO NOT PLACE THE TMPFILE INSIDE/BELOW THE TMPFOLDER IF YOU DON'T EXACTLY KNOW WHAT YOUR ARE DOING TMPFILE="/tmp/zipconverstion.txt" #commands needed for this script UNZIP_CMD="/usr/bin/unzip" FIND_CMD="/usr/bin/find" #extent the extention and command array for your personal needs #note: the first parameter of the cmd must be the input, the second is the output filename. e.g. /opt/office2txt.sh <inputfile> <outputfile> FILEEXT[0]="doc"; CMD[0]="/opt/office2txt.sh" FILEEXT[1]="pdf"; CMD[1]="/usr/bin/pdftotext" #IO definitions zipfile=$1 outputfile=$2 #generate filter string from FILEEXT filter="" for ext in "${FILEEXT[@]}" do filter="$filter *.$ext" done #Unzip only content into TMPFOLDER with known extensions, ignoring case sensitivity of filter "-C", # The "-P \n" is needed to tell unzip that we do not have a valid password so it does not ask on stdin # if a file is encrypted $UNZIP_CMD -o -qq -C -P \n $1$filter -d $TMPFOLDER #put all filenames into an array which are inside the TMPFOLDER. #Whitespaces in filenames are handled correctly (from http://mywiki.wooledge.org/BashFAQ/020) unset filenames i while IFS= read -r -d '' file; do filenames[i++]=$file # echo "File: ${filenames[i-1]}" done < <($FIND_CMD $TMPFOLDER -type f -print0) #switch off case sensitivity shopt -s nocasematch #convert each file to txt according the command set in CMD for file in "${filenames[@]}" do echo "Working on file: $file" #get fileextention input_filename_w_ext=$(basename "$file") input_extension=${input_filename_w_ext##*.} #search extension in FILEEXT array (case insensitive) # get length of an array tLen=${#FILEEXT[@]} extfount=0 for (( i=0; i<${tLen}; i++ )); do if [[ ${FILEEXT[$i]} = $input_extension ]] then rm -f $TMPFILE #make sure it is empty #execute conversion cmd echo ${CMD[$i]} "$file" "$TMPFILE" ${CMD[$i]} "$file" "$TMPFILE" #append $TMPFILE to output file $outputfile cat $TMPFILE >> $outputfile break fi done done #switch on case sensitivity shopt -u nocasematch #remove all stuff in the temp folder and the temp file rm -rf $TMPFOLDER/* rm -f $TMPFILE
WARNING: Because this script joins all content found in the zip file to one huge text file, the indexing process (PHP) will need a lot of memory! You better dump the output of this conversion script to a logfile and check it on a regular basis or for errors! To increase the memory have a look at the tips at the top of the page. I had to set the memory limit of PHP to 250 MB because the generated txt file from this script was 8.8 MByte in size. This can happen very easy if a zip file contains a lot of PDF documents!
Installation in Windows 2003
I run Dokuwiki for our company's intranet on a Windows 2003 server with XAMPP. The following short description explains how I got docsearch to run on this system.
Converters
For me, the following converters worked:
- PDF: pdftotext, which you can find here http://www.foolabs.com/xpdf/download.html
- Office documents: cattdoc, xls2csv and catppt, which you can find here http://blog.brush.co.nz/2009/09/catdoc-windows/ The conversion for DOT, XLT and the newer Office formats is not perfect but the quality, in my opinion, is sufficient as an input for docsearch.
Install them in an appropriate location, i.e. C:\TOOLS and adjust the converter.php-file (replace the [PATH TO] with your actual path, i.e. C:\TOOLS\XPDF and C:\TOOLS\CATDOC).
- converter.php
pdf [PATH TO]\pdftotext.exe %in% %out% doc [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out% xls [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out% ppt [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out% docx [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out% xlsx [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out% pptx [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out% xlt [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out% dot [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
For catdoc, -a -s koi8-u -d koi8-u means
- -a: output is in ascii-format
- -s koi8-u: source (input) is in utf-8 unicode format (which should be the case for Fffice documents)
- -d koi8-u: destination (output) is in utf-8 unicode format
Cronjob
Instead of a cronjob, set up a scheduled task in Windows to index new files. To do this, go to Start→Programs→Accessories→System Tools→Scheduled Tasks, set up a new task and enter the following into the “run” field: “[PATH TO]\php.exe [PATH TO]\cron.php” (the cron.php file is in a subdirectory of the docsearch plugin).
Example: The “run” field should contain something like this:
C:\Programs Files\php\php.exe C:\Website\dokuwiki\lib\plugins\docsearch\cron.php
or with XAMPP as a server environment
C:\xampp\php\php.exe C:\xampp\htdocs\dokuwiki\lib\plugins\docsearch\cron.php
The rest of the setup should be straight forward.
Issues
You may get a “file not found” or “path not found” error under cron.php
when using some utilities or command line expressions in the converter.php
file. This is due to the path slashes not being formatted to DOS/Windows format.
To fix this, insert this code around line 87 in cron.php after $cmd = str_replace('%out%', escapeshellarg($out), $cmd);
:
if (strtoupper(substr(PHP_OS, 0, 3)) === 'WIN') { $cmd = str_replace('/', '\\', $cmd); }