DokuWiki

It's better when it's simple

User Tools

Site Tools


devel:fulltextindex

Fulltext Index

For quickly searching the wiki, a fulltext index is used.

FIXME documentation in progress - might be wrong.

The index system is designed with the following three premises in mind:

  1. PHP execution time is limited (usually 30 seconds, on some hosts less)
  2. Memory is limited (we recommend 32MB but it's less for many hosters)
  3. Disk Space is cheap

Structure

All parts of the fulltext index are stored in data/index:

  • page.idx – a list of all known pages
  • w<word length>.idx – a list of all known words with a byte length of <word length>
  • i<word length>.idx – word → page assignments
  • pageword.idx – page → word assignments

page.idx

This file contains all ever indexed pages, one page per line. The line number (starting from 0) is considered the PID.

Note, that pages are not removed from this index when they are deleted. The existence check is done on search time.

w<word length>.idx

These files contain a list of all ever indexed words with a byte length of <word length>. The line number (starting from 0) is considered the WID

i<word length>.idx

These files contain the real index by mapping words to the pages they occur on. The file also contains the frequency of the word on the given page, which is used for search ranking.

The line numbers correspond to the ones in w<word length>.idx, so the w and i indexes should always have the same number of lines.

Each line contains PID*<frequency> pairs separated by colons.

Imagine line 0 of i5.idx containing this entry:

0*2:55*5:23*1

this would mean that word 0 (→ line 0 in w5.idx) occurs 2 times on page 0 (→ line 0 in pages.idx), 5 times on page 55 (→ line 55 in pages.idx) and one time on page 23 (→ line 23 in pages.idx).

pageword.idx

This index is used to remove words from the index that are no longer on a changed or deleted page.

The line number (starting from 0) is the PID from the page.idx. Each line contains <wordlength>*WID pairs separated by colons.

As only one line of that index is read and written during an index update that index is quite efficient.

Indexing

inc/indexer.php contains all fulltext index building related functions.

FIXME describe the process

Searching

inc/fulltext.php contains all fulltext index searching related functions.

FIXME describe the process

devel/fulltextindex.txt · Last modified: 2018-11-28 20:47 by torpedo

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki