DokuWiki

It's better when it's simple

User Tools

Site Tools


tips:utf8update

DokuWiki UTF8 conversion

DokuWiki uses UTF-8 encoding for storing data since release 2005-02-06. This allows you to add all kind of languages to the same Wiki-Installation. This means if you upgrade from an older version you need to reencode your data files.

If you are installing DokuWiki for the first time, you don't need to do anything - DokuWiki will work out of the box.

You can either recode all existing pages yourself, eg. using iconv or recode or use the “UTF-8 conversion helper” described below.

If you do the conversion yourself, please note that DokuWiki stores filenames urlencoded so you may have to rename your files, too.

UTF-8 conversion helper

:!: This script wasn't updated for a long time and is not compatible with newer DokuWiki releases so it will not work out of the box anymore. You should have a look at the bash script below for an alternative way to upgrade old datafiles.

The simplest way to upgrade your datafiles to UTF8 is to use the “dokuwiki-convert” script: dokuwiki-convert-latest.tgz

The script will walk through your data directory and reencode all the files for you.

Usage

  1. Recommended: Deny writing for all users to your Wiki using the ACL feature or a .htaccess file
  2. create a Backup of all your files :!:
  3. upgrade your DokuWiki to the newest version as usual
  4. install dokuwiki-convert somewhere on your webserver 1)
  5. edit the dokuwiki-convert/index.php file
    • You need to set the full filesystem path to your DokuWiki at the very top eg. /var/www/dokuwiki/
  6. point your webbrowser to the dokuwiki-convert script
  7. choose your current file encoding
  8. hit the Do the conversion button

Additional Notes

  • The script does not convert your old revisions.
    • You need to delete them, or convert them your self.
  • The script does not convert your changes.log.
    • You need to delete them, or convert them your self.
  • The script may timeout when running in safemode
    • just rerun it multiple times until it says it has finished
    • if it does not work for you, you need to do the conversion yourself
  • For english wikis the script will skip a lot of files
    • US-ASCII is a subset of UTF-8 so there is no need for converting these files

Sample Bash script for conversion with iconv

The following code might be helpful in doing the conversion yourself with iconv. Besides converting the data dir, this script does convert changes.log and the old revisions. Run this script from the data directory
#!/bin/bash
FROM=latin1
TO=utf8
ICONV="iconv -f $FROM -t $TO"
 
# Convert changes.log
cp changes.log changes.log.bak
$ICONV < changes.log.bak > changes.log
rm changes.log.bak
 
# Convert pages/ subdir
find pages/ -type f -name "*.txt" | while read fn; do
	cp ${fn} ${fn}.bak
	$ICONV < ${fn}.bak > ${fn}
	rm ${fn}.bak
done
 
# Convert attic/ subdir (where the script assumes gzip compression)
find attic/ -type f -name "*.txt.gz" | while read fn; do
	cp ${fn} ${fn}.bak
	{ gzip -cd | $ICONV | gzip -c; } < ${fn}.bak > ${fn}
	rm ${fn}.bak
done
To use this script in WindowsXP Pro (or Windows 2000 Pro) with Cygwin, for ISO8859-15 (pt_PT), I had to change the first lines of the script to:
#!/bin/bash
FROM=ISO8859-15
TO=UTF-8
Everything else remains the same, and the result of the execution was successful. I've been able to convert two entire DokuWiki-enabled sites in less than 5 minutes. I found out about the correct encodings after issuing the following command on a Cygwin-Bash Prompt:
iconv -l
I have modified the script to keep the timestamps for the files in data/Andrea 2005-11-04 11:57
# Convert data/ subdir
find data/ -type f -name "*.txt" | while read fn; do
        cp -p ${fn} ${fn}.bak
        $ICONV < ${fn}.bak > ${fn}
        touch -r ${fn}.bak ${fn}
        rm ${fn}.bak
done
I have modified it again to keep unmodified files with the same timestamps, for easier using with CVS, it looks for *.java and *.jsp files in the current directory and subdirs — Flavio 2008-01-29
#!/bin/bash
FROM=cp1252
TO=utf8
ICONV="iconv -f $FROM -t $TO"
 
find . -type f -name "*.java" -or -name "*.jsp" | while read fn; do
	cp ${fn} ${fn}.bak
    touch -r ${fn} ${fn}.bak
	$ICONV < ${fn}.bak > ${fn}
	TEST=`cmp ${fn} ${fn}.bak`
	if [ -z "$TEST" ]; then 	
	        touch -r ${fn}.bak ${fn}
	else
		echo MODIFIED - ${fn} 
	fi
	rm ${fn}.bak
done

manual conversion with editpad lite

as i couldn't get the above scripts working, i converted my pages manually, using the ansi>utf-8 converter from edit pad

1)
you can put it as an additonal directory in your DokuWiki directory if you like
tips/utf8update.txt · Last modified: 2010-06-12 13:50 by andi

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki