DokuWiki

It's better when it's simple

User Tools

Site Tools


Sidebar

Translations of this page?:

Learn about DokuWiki

Advanced Use

Corporate Use

Our Community


Follow us on Facebook, Twitter and other social networks.

devel:fulltextindex

Fulltext Index

For quickly searching the wiki, a fulltext index is used.

FIXME documentation in progress - might be wrong.

The index system is designed with the following three premises in mind:

  1. PHP execution time is limited (usually 30 seconds, on some hosts less)
  2. Memory is limited (we recommend 32MB but it's less for many hosters)
  3. Disk Space is cheap

Structure

All parts of the fulltext index are stored in data/index:

  • page.idx – a list of all known pages
  • w<word length>.idx – a list of all known words with a byte length of <word length>
  • i<word length>.idx – word → page assignments
  • pageword.idx – page → word assignments

page.idx

This file contains all ever indexed pages, one page per line. The line number (starting from 0) is considered the PID.

Note, that pages are not removed from this index when they are deleted. The existence check is done on search time.

w<word length>.idx

These files contain a list of all ever indexed words with a byte length of <word length>. The line number (starting from 0) is considered the WID

i<word length>.idx

These files contain the real index by mapping words to the pages they occur on. The file also contains the frequency of the word on the given page, which is used for search ranking.

The line numbers correspond to the ones in w<word length>.idx, so the w and i indexes should aways have the same number of lines.

Each line contains PID*<frequency> pairs separated by colons.

Imagine line 0 of i5.idx containing this entry:

0*2:55*5:23*1

this would mean that word 0 (→ line 0 in w5.idx) occurs 2 times on page 0 (→ line 0 in pages.idx), 5 times on page 55 (→ line 55 in pages.idx) and one time on page 23 (→ line 23 in pages.idx).

pageword.idx

This index is used to remove words from the index that are no longer on a changed or deleted page.

The line number (starting from 0) is the PID from the page.idx. Each line contains <wordlength>*WID pairs separated by colons.

As only one line of that index is read and written during an index update that index is quite efficient.

Indexing

inc/indexer.php contains all fulltext index building related functions.

FIXME describe the process

Searching

inc/fulltext.php contains all fulltext index searching related functions.

FIXME describe the process

devel/fulltextindex.txt · Last modified: 2013-08-14 15:39 by 82.7.195.164