DokuWiki

It's better when it's simple

User Tools

Site Tools


plugin:docsearch

DocSearch Plugin

Compatible with DokuWiki

  • 2024-02-06 "Kaos" unknown
  • 2023-04-04 "Jack Jackrum" unknown
  • 2022-07-31 "Igor" unknown
  • 2020-07-29 "Hogfather" yes

plugin Search through your uploaded documents

Last updated on
2016-07-18
Provides
Action
Repository
Source
Conflicts with
searchstats

Similar to elasticsearch, searchtext

Tagged with search

Needed for docsearchsitemap

This plugin allows you to search through your uploaded documents. It is integrated into the default DokuWiki search. Just fill in a search string and start to search.

:!: A probably better alternative to this plugin, is the elasticsearch Plugin with its ability to index documents.

A CosmoCode Plugin

Download and Installation

Search and install the plugin using the Extension Manager. Refer to Plugins on how to install plugins manually.

Changes

Cronjob

To create the search index you have to set up a cronjob (or scheduled task under windows) that runs the dokuwiki/lib/plugins/docsearch/cron.php. You can also use online cron job service https://www.easycron.com to trigger the script and a tutorial at https://www.easycron.com/cron-job-tutorials/how-to-set-up-cron-job-for-dokuwiki-docsearch.

The search just finds documents in the index. If you create the index, upload a new file and search for the file, you won't find it until you rebuild the index.

It is possible that you need to increase the memory_limit from your PHP configuration. See ini.core

:!: Because docsearch uses the cron.php as a CLI-PHP (Command Line) script, you have to increase memory_limit in /etc/php5/cli/php.ini Joachim 10.01.2011

Note: if you run a DokuWiki farm, you need to run the cronjob for each animal seperately, passing the animal's name as first parameter to the script.

Configuration

To configure the search you have to edit the dokuwiki/lib/plugins/docsearch/conf/converter.php.

Use this file to setup the document to text converter.

The plugin tries to convert every media document to a text file. On this progress it uses a given set of external tools to convert it. These tools are defined per file extension. The config stores one extension and its tool per line. You can use %in% and %out% for the input and output file.

the abstract syntax is

fileextension /path/to/converter -with_calls_to_convert --from inputfile --to outputfile
:!: you can use the ConfManager Plugin to edit the config

example config for pdf and doc:

#<?php die(); ?>
pdf /usr/bin/pdftotext -enc UTF-8 %in% %out%
doc /usr/bin/antiword %in% > %out%
odt /usr/bin/odt2txt %in% --output=%out%

the first line disallows users to browse this file with a browser.

the second line says the PDF extension, the path to the converter with the two parameters %in% and out.

the third line covers doc documents. Antiword just print the text to the stdout so we use the > to get the text into a file.

you have to ensure that the output file is UTF-8 encoded. Otherwise you might get trouble with non-ASCII characters.

Todo

  • Allow the user to use the DokuWiki indexer to index the documents.
  • Index documents that have been modified or changed only. → Skip already indexed documents for performance reasons.

Conversion settings

Office documents

Using jodconverter and OpenOffice.org

I would like to share some conversion settings which worked for me

I am using the jodconverter together with openoffice in headless mode and the following settings:

converter.php
doc <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
docx <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
odt <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%

The calc formats ods, xls, xlsx can be converted by using scripts to convert them to .csv first using jodconverter and then rename them to .txt. Unfortunately only the first spreadsheet gets converted when output is csv. Using PDF conversion all spreadsheets including their names get converted (tested only for ods).

Unfortunately the jodconverter does not convert ppt or pptx directly to txt. It would be possible to convert them first to a PDF and run the the pdftotxt converter afterward but I don't like the overhead of such a chained solution. Are there any free command line tools out there to convert the mentioned format on a Linux machine?

HINT:When using OpenOffice.org in headless mode. Make sure you have enough memory. Otherwise it can crash and the indexing of all following documents will fail → jodconverter complains that it can not connect to the OpenOffice.org server.

Using jodconverter OpenOffice.org and a script

converter.php
ppt                             <path to office2txt.sh>/office2txt.sh %in% %out%
pptx                            <path to office2txt.sh>/office2txt.sh %in% %out%
odp                             <path to office2txt.sh>/office2txt.sh %in% %out%
xls                             <path to office2txt.sh>/office2txt.sh %in% %out%
xlsx                            <path to office2txt.sh>/office2txt.sh %in% %out%
ods                             <path to office2txt.sh>/office2txt.sh %in% %out%

Here is the bash script I am using to do a chained conversion because jodconverter can not convert them directly to txt files. First to PDF and then to txt. Comments welcome since I am no bash guru…

office2txt.sh
#!/bin/bash
# Converter script to convert almost everything openoffice can read to txt using the jodconverter
# and the pdf2txt tool
# Because the jodconverter can not convert files formats like ppt, pptx, xls, ods, xlsx to txt directly,
# a conversion to PDF is performed first using the jodconvert. The second step is a conversion from
# PDF to txt using the pdftotxt commandline tool
# usage: all2text.sh <inputfile> <outputfile>
# <inputfile> is a arbitrary file open office can read (with correct file extension!)
# <outputfile> is the filename the result should go to. (txt as file extension)
#
# adapt the settings below to your own needs
 
 
echo "Input: $1"
 
#jodconverter binary cmd
JODCONVERTER_CMD=/opt/jodconverter/lib/jodconverter-cli-2.2.2.jar
#pdf2txt binary cmd (find out your path using the 'which pdftotxt' cmd)
PDF2TXT_CMD=/usr/bin/pdftotext
#your java cmd
JAVA_CMD=/usr/bin/java
#temporary folder for storing the PDF (path without trailing /)(you need to have write access here!)
TMP_FOLDER=/tmp/pdftmp
 
#extract input name
input_fullfile=$1
input_filename_w_ext=$(basename "$input_fullfile")
input_extension=${input_filename_w_ext##*.}
input_filename_wo_ext=${input_filename_w_ext%.*}
 
#first conversion to PDF:
tmpfile=$TMP_FOLDER/$input_filename_wo_ext".pdf"
$JAVA_CMD -jar $JODCONVERTER_CMD "$input_fullfile" "$tmpfile"
 
#second conversion to txt:
$PDF2TXT_CMD "$TMP_FOLDER/$input_filename_wo_ext.pdf" "$2"
 
#remove tmp file
rm -f $tmpfile

An Alternative to openoffice is to use Apache Tika http://tika.apache.org/ Example:

/usr/bin/java -jar /path/to/apache-tika/tika-app-x.xx.jar --text %in% > %out%

Mindmaps from FreeFind

Using xsl transformation

To convert files generated from FreeFind (.mm) to text one can use a xslt transformation with a xsl document provided by FreeFind ( I took mm2csv.xsl from FreeFind 0.9beta which worked well on files generated with 0.8.1)

converter.php
mm                              <path to converter script>/mm2txt.sh %in% %out%

Here is the little script which uses xmlstarlet to apply the xsl document to the FreeFind file

mm2txt.sh
#!/bin/bash
# Converter script to convert mindmaps generated by FreeFind to txt
# The conversion is done by a xsl definition and the commandline tool xmlstarlet
# The used xsl file "mm2csv.xsl" can be found inside the FreeFind 0.9 (beta) archive
# at the folder accessories which can be downloaded at http://freemind.sourceforge.net
 
#Full path to the xsl file
XSL_FILE=/opt/mm2csv.xsl
#Full path to the commandline converter xmlstarlet
XML_STARLET=/usr/bin/xmlstarlet
 
#conversion
$XML_STARLET tr $XSL_FILE $1 > $2

ZIP Files

For zip files this little script can be used. The command line tools for the conversion need to be added for each document type. The known document types get extracted to a temp folder where they are converted to txt and joined to one big text file which can be indexed.

Currently only conversion tools are supported which have the following style: <cmd> <inputfile> <outputfile>

zip2txt.sh
#!/bin/bash                                           
# This is a converter script to convert the content from a zip file to a single txt file.
# All files which extensions are defined in this script get unzipped, converted to text and joined to one single output file
# usage: zip2txt.sh <inputfile> <outputfile>                                                                               
 
#adapt this:
#Folder where the zip file is unpacked WARNING: DO NOT USE THIS FOLDER FOR ANYTHING ELSE -> all files in there will be converted!
TMPFOLDER="/tmp/zipconverter"         
#File which is used as a temporary storage
#DO NOT PLACE THE TMPFILE INSIDE/BELOW THE TMPFOLDER IF YOU DON'T EXACTLY KNOW WHAT YOUR ARE DOING
TMPFILE="/tmp/zipconverstion.txt"                                                                 
#commands needed for this script                                                                  
UNZIP_CMD="/usr/bin/unzip"                                                                        
FIND_CMD="/usr/bin/find"                                                                          
 
#extent the extention and command  array for your personal needs
#note: the first parameter of the cmd must be the input, the second is the output filename. e.g. /opt/office2txt.sh <inputfile> <outputfile>
FILEEXT[0]="doc"; CMD[0]="/opt/office2txt.sh"                                                                                          
FILEEXT[1]="pdf"; CMD[1]="/usr/bin/pdftotext"                                                                                          
 
#IO definitions
zipfile=$1     
outputfile=$2  
 
#generate filter string from FILEEXT
filter=""                           
for ext in "${FILEEXT[@]}"          
do                                  
  filter="$filter *.$ext"
done
 
#Unzip only content into TMPFOLDER with known extensions, ignoring case sensitivity of filter "-C",
# The  "-P \n" is needed to tell unzip that we do not have a valid password so it does not ask on stdin
# if a file is encrypted                                                                               
$UNZIP_CMD -o -qq -C -P \n $1$filter -d $TMPFOLDER  
 
#put all filenames into an array which are inside the TMPFOLDER.
#Whitespaces in filenames are handled correctly (from http://mywiki.wooledge.org/BashFAQ/020)
unset filenames i
while IFS= read -r -d '' file; do
  filenames[i++]=$file
#  echo "File: ${filenames[i-1]}"
done < <($FIND_CMD $TMPFOLDER -type f -print0)
 
#switch off case sensitivity
shopt -s nocasematch
 
#convert each file to txt according the command set in CMD
for file in "${filenames[@]}"
do
echo "Working on file: $file"
  #get fileextention
  input_filename_w_ext=$(basename "$file")
  input_extension=${input_filename_w_ext##*.}
  #search extension in FILEEXT array (case insensitive)
  # get length of an array
  tLen=${#FILEEXT[@]}
  extfount=0
  for (( i=0; i<${tLen}; i++ ));
  do
    if [[ ${FILEEXT[$i]} = $input_extension ]]
    then
      rm -f $TMPFILE #make sure it is empty
      #execute conversion cmd
      echo ${CMD[$i]} "$file" "$TMPFILE"
      ${CMD[$i]} "$file" "$TMPFILE"
      #append $TMPFILE to output file $outputfile
      cat $TMPFILE >> $outputfile
      break
    fi
  done
done
#switch on case sensitivity
shopt -u nocasematch
 
#remove all stuff in the temp folder and the temp file
rm -rf $TMPFOLDER/*
rm -f $TMPFILE

WARNING: Because this script joins all content found in the zip file to one huge text file, the indexing process (PHP) will need a lot of memory! You better dump the output of this conversion script to a logfile and check it on a regular basis or for errors! To increase the memory have a look at the tips at the top of the page. I had to set the memory limit of PHP to 250 MB because the generated txt file from this script was 8.8 MByte in size. This can happen very easy if a zip file contains a lot of PDF documents!

Installation in Windows 2003

I run Dokuwiki for our company's intranet on a Windows 2003 server with XAMPP. The following short description explains how I got docsearch to run on this system.

Converters

For me, the following converters worked:

Install them in an appropriate location, i.e. C:\TOOLS and adjust the converter.php-file (replace the [PATH TO] with your actual path, i.e. C:\TOOLS\XPDF and C:\TOOLS\CATDOC).

converter.php
pdf    [PATH TO]\pdftotext.exe %in% %out%
doc    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
xls    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
ppt    [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out%
docx    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
xlsx    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
pptx    [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out%
xlt    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
dot    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%

For catdoc, -a -s koi8-u -d koi8-u means

  • -a: output is in ascii-format
  • -s koi8-u: source (input) is in utf-8 unicode format (which should be the case for Fffice documents)
  • -d koi8-u: destination (output) is in utf-8 unicode format

Cronjob

Instead of a cronjob, set up a scheduled task in Windows to index new files. To do this, go to Start→Programs→Accessories→System Tools→Scheduled Tasks, set up a new task and enter the following into the “run” field: “[PATH TO]\php.exe [PATH TO]\cron.php” (the cron.php file is in a subdirectory of the docsearch plugin).

Example: The “run” field should contain something like this:

C:\Programs Files\php\php.exe C:\Website\dokuwiki\lib\plugins\docsearch\cron.php

or with XAMPP as a server environment

C:\xampp\php\php.exe C:\xampp\htdocs\dokuwiki\lib\plugins\docsearch\cron.php

The rest of the setup should be straight forward.

Issues

You may get a “file not found” or “path not found” error under cron.php when using some utilities or command line expressions in the converter.php file. This is due to the path slashes not being formatted to DOS/Windows format.

To fix this, insert this code around line 87 in cron.php after $cmd = str_replace('%out%', escapeshellarg($out), $cmd);:

if (strtoupper(substr(PHP_OS, 0, 3)) === 'WIN') {
		$cmd = str_replace('/', '\\', $cmd);
}
plugin/docsearch.txt · Last modified: 2024-01-08 17:05 by Aleksandr

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki