DokuWiki

It's better when it's simple

User Tools

Site Tools


plugin:asiansearch

This is an old revision of the document!


Asian Search Plugin

Compatible with DokuWiki

2009-02-14

plugin Manipulates a search query for a better search experience in Asian languages

Last updated on
2009-09-06
Provides
Action
Conflicts with
pageredirect

This extension has not been updated in over 2 years. It may no longer be maintained or supported and may have compatibility issues.

Tagged with asian, cjk, highlight, search

This plugin provides a better search experience in Asian languages by manipulating a search query. No re-indexing is required.

Download and Installation

Download and install the plugin using the Plugin Manager using the following URL. Refer to Plugins on how to install plugins manually.

History

  • 2009-09-06 – Initial release

Overview

DokuWiki has a full-text search function which now also supports Asian languages. However, searching Asian language texts still has some problems. The problems are that:

  • Search result highlights are a bit too fragmented
  • An ideographic space (U+3000) in a search query is treated not as a search term separator, but as a character itself

This plugin solves these problems by manipulating a search query, never making changes to your DokuWiki's search index files.

Examples

Let's assume that your DokuWiki has a page whose text is:

京都から東海道新幹線で東に向かうと、東京に着いた。

Search Result Highlights

Search Query: [ 東京 ]

Search Result/Highlights Impressions
Plain Dokuwiki 都から海道新幹線でに向かうと、東京に着いた。 Too fragmented. Noisy.
With this plugin 京都から東海道新幹線で東に向かうと、東京に着いた。 Good.

Ideographic Spaces

Search Query: [ 新幹線 東京 ]

Search Result/Highlights Impressions
Plain DokuWiki :!: No Hits! Why?
With this plugin 京都から東海道新幹線で東に向かうと、東京に着いた。 As I expected.

Note that the space between "新幹線" and "東京" is not a normal space but an ideographic space (U+3000). In a plain DokuWiki, the search query [ 新幹線 東京 ] is parsed as a monolithic "新幹線 東京", not as separated "新幹線" and "東京".

How It Works

This plugin manipulates a search query by using the following steps:

  1. Puts “phrases” aside not to make any changes within preexistent phrases
  2. Replaces ideographic spaces with normal spaces
  3. Makes a phrase for each successive Asian character

Below is a example of complicated query.

Original Query [ dokuwiki "asiansearch plugin" プラグイン 插件 플러그인 "検 索" ]
Manipulated Query [ dokuwiki "asiansearch plugin" "プラグイン" "插件" "플러그인" "検 索" ]

You can see that Asian characters are quoted, ideographic spaces are replaced with normal spaces, and nothing is changed within preexistent phrases.

By default, DokuWiki treats each Asian character as a “word”, and additionally, each successive Asian character as a “phrase”. DokuWiki highlights both “words” and “phrases” in search results. Preprocessing a search query in above way reduces “words”, resulting in neat highlights of your search results.

By checking the returned values of ft_queryParser, you can see how DokuWiki parsed these queries.

Original Query:

'phrases':
  - 'asiansearch plugin'
  - '検 索'
  - ' プラグイン 插件 플러그인'
'words':
  - 'dokuwiki'
  - 'プ'
  - 'ラ'
  - 'グ'
  - 'イ'
  - 'ン'
  - '插'
  - '件'
  - '플'
  - '러'
  - '그'
  - '인'

Manipulated Query:

'phrases':
  - 'asiansearch plugin'
  - 'プラグイン'
  - '插件'
  - '플러그인'
  - '検 索'
'words':
  - 'dokuwiki'

Discussion

The problem this plugin is trying to solve looks like a bug to me. DokuWiki should handle Asian queries correctly out of the box. Could you please open a bug report to get this solved in core? — Andreas Gohr 2009/09/06 23:05

plugin/asiansearch.1252293264.txt.gz · Last modified: 2009-09-07 05:14 by kazmiya