Table of Contents
How to convert docs to DokuWiki
I was just googleing a little to some conversion tools.
Hopefully i meet this :
Main goal : Magic conversion in bureaucratic environment
\*.doc -> *.html ---> *.txt ((wiki syntax))
To do this here the main schema in use :
Step 0 | Preparing the environment
Dependencies :
- Linux !
- OpenOffice.org: http://www.openoffice.org/
- Java and JODConverter by: http://www.artofsolving.com
- Perl and WikiConverter module: http://search.cpan.org/src/DIBERRI/HTML-WikiConverter-0.61/
- Apache, PHP, and a DokuWiki out of the box … or MediaWiki..
- Optional, the extension FCKW for DokuWiki: fckw
Code needed
three files :
- The main bash script : oocwiki.sh The code.
- The cleaning bash script : cleanfolder.sh The code.
- The renaming / auto loop conversion Perl script : oocwiki.pl The code.
Copy this code and create the files needed in a folder of your computer.
Folders :
Create your folder with your bunch of Ms Word files :
Ms World environment :
ENWOLRD=/home/massou/Documents/oldies/
and write on the bash script the parameters for others folders and files we need :
Temp folder :
TMPOOCWIKI=/tmp/oocwiki/
JODConverter folder ;
JODCON=/home/massou/Documents/perl/jodconverter-2.2.1/lib/jodconverter-cli-2.2.1.jar
DokuWiki transfert folder :
OUTWIKI=/srv/www/htdocs/dokuwiki/data/pages/outdoc/ OUTMEDIA=/srv/www/htdocs/dokuwiki/data/media/outdoc/
and use this bash
oocwiki.sh
- oocwiki.sh
#!/bin/bash # script oocwiki.sh # # sh oocwiki.sh /home/massou/Documents/oldies/ /tmp/oociKi/ # This script makes a backup of my home directory. # Change the values of the variables to make the script work for you: ENWOLRD=/home/massou/Documents/oldies/ TMPOOCWIKI=/tmp/oocwiki/ JODCON=/home/massou/Documents/perl/jodconverter-2.2.1/lib/jodconverter-cli-2.2.1.jar OUTWIKI=/srv/www/htdocs/dokuwiki/data/pages/outdoc/ OUTMEDIA=/srv/www/htdocs/dokuwiki/data/media/outdoc/ if [ $(whoami) != 'root' ]; then echo "Must be root to run $0" exit 1; fi # if [ -z $1 ]; then # echo "Usage: $0 </path/to/httpd.conf>" # exit 1 # fi parameters=($ENWOLRD $TMPOOCWIKI $OUTWIKI $OUTMEDIA) ## is parameters ok ? for i in ${parameters[@]}; do if [ ! -e "${i}" ]; then echo "${i} don't exist" mkdir ${i} echo "${i} resolved" elif [ -f "${i}" ]; then echo "${i} est un fichier" elif [ -d "$1" ]; then echo "${i} sembre prêt" fi done if [ ! -e "$JODCON" ]; then echo "$JODCON n'existe pas" exit 1; elif [ -f "$JODCON" ]; then echo "$JODCON is ready" fi pgrep soffice retval=$? if [ "$retval" = 1 ] then echo "soffice n'a pas l'air de fonctionner..." soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard & fi ###cleaning and copy parameters=($TMPOOCWIKI $OUTWIKI $OUTMEDIA) ## is parameters ok ? for i in ${parameters[@]}; do if [ -e "${i}" ]; then echo "${i} don't exist" rm -R ${i} mkdir ${i} echo "${i} resolved" fi done cp -R $ENWOLRD/* $TMPOOCWIKI ################### Step1 Some cleaning ################## sh ./cleanfolder.sh $TMPOOCWIKI ######################### Step 2-3 Time of perl ################# perl oocwiki.pl $TMPOOCWIKI $JODCON ######################### Step 4 Copy of the files ################# cp -R $TMPOOCWIKI/* $OUTWIKI cp -R $TMPOOCWIKI/* $OUTMEDIA ########### Step 5 time for ACL ######### parameters=($OUTWIKI $OUTMEDIA) ## is parameters ok ? for i in ${parameters[@]}; do chown -R wwwrun ${i} chgrp -R www ${i} chmod -R 775 ${i} done
Step 1 | cleaning the Ms Word environment :
/////*.doc
Bash or Perl script for renaming folder / under folder / file name from Windows file system to more simply Unix-like syntax
cleanfolder.sh
- cleanfolder.sh
#!/bin/bash # file cleanfolder.sh # Convert filenames to lowercase # and replace characters recursively ##################################### if [ -z $1 ];then echo Give target directory; exit 0;fi find "$1" -depth -name '*' | while read file ; do directory=$(dirname "$file") oldfilename=$(basename "$file") newfilename=$(echo "$oldfilename" | tr 'A-Z' 'a-z' | tr ' ' '_' | sed 's/_-_/-/g') if [ "$oldfilename" != "$newfilename" ]; then mv -i "$directory/$oldfilename" "$directory/$newfilename" echo ""$directory/$oldfilename" ---> "$directory/$newfilename"" #echo "$directory" #echo "$oldfilename" #echo "$newfilename" #echo fi done exit 0
Step 2 :
lower_case/whithout_blank_space.doc —> Soffice as a service + jodconverter —> *.html
oocwiki.pl
- oocwiki.pl
#!/usr/bin/perl -w $time = localtime; print "The time is now $time\n"; my $TMPOOCWIKI=$ARGV[0]."\n"; my $JODCON=$ARGV[1]."\n"; print $TMPOOCWIKI."\n"; print $JODCON."\n"; $chemin = $TMPOOCWIKI; $jod = $JODCON; chomp($chemin); chomp($jod); use File::Basename; use File::Find; find(\&Wanted, $chemin); sub Wanted { if ($File::Find::name =~ m/^$DocumentRoot(\/.*)?$/) { $fullname = $File::Find::name . "\n"; ($name,$path,$suffix) = fileparse($fullname,qr{\..*}); $suffix . "\n"; if ($suffix eq '.doc'){ # if ($suffix = "\.doc") { $name = fileparse($fullname); $basename = basename($fullname); $dir = dirname($fullname); $base2=lc($name); $base2 =~ tr/ /_/; $base2 =~ tr/ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ/aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn/; #Step1 renaming, again $dir =~ s/$/\//; $newname = $dir.$base2; # $newname =~ s/$/\.doc/; print $fullname; print $newname; print $fullname; print $newname; # $fullname =~ s/ /\\ /; # $newname =~ s/ /\\ /; chomp($fullname); chomp($newname); # # print $newname; rename("$fullname", "$newname") or warn "Couldn't rename $fullname to $newname: $!\n"; #Prepare newname for conversion $newname2 = $newname; $newname3 = $newname; $newname2 =~ s/\.doc$/\.html/ ; $newname3 =~ s/\.doc$/\.txt/ ; # print "sortie-----$newname2\n"; # Subroutine to execute the command step 2 and 3 my $res=""; my $cmd="java -jar $jod $newname $newname2|"; my $cmd2="html2wiki --dialect DokuWiki $newname2 > $newname3|"; open(EXEC,"$cmd"); while($res=<EXEC>){ chomp($res); print "$res \n"; } close(EXEC); open(EXEC,"$cmd2"); while($res=<EXEC>){ chomp($res); print "$res \n"; } close(EXEC); } } }
Step3 :
*.html —> HtmlWikiConverter —> *.txt
Step4 :
Finally we just copy the files to media and pages folders… enough.
Perl scripting to change URL of media to point to good URL media and dispatch media and txt files in good place on the server…
Step5 :
Fix permissions.
Command lines in use
First you need OpenOffice.org on a Linux box.
go to a terminal and execute this :
soffice -headless -accept="socket,port=8100;urp;"
http://www.artofsolving.com/node/10
(dont forget cli :!=à
java -jar jodconverter-cli-2.2.1.jar A.doc A.pdf java -jar jodconverter-cli-2.2.1.jar A.doc A.html
http://search.cpan.org/src/DIBERRI/HTML-WikiConverter-0.61/README
massou@linux-hj6y:~/Documents/momas/jodconverter-2.2.1/lib> html2wiki --dialect DokuWiki A.html > output.mw