DokuWiki

It's better when it's simple

User Tools

Site Tools


plugin:nbsp

Non-breaking Space Syntax PlugIn

Compatible with DokuWiki

2005-07-13+

plugin Use non-breaking spaces.

Last updated on
2007-08-15
Provides
Syntax

This extension has not been updated in over 2 years. It may no longer be maintained or supported and may have compatibility issues.

Tagged with typography

A so-called non-breaking space1) is a character which is rendered visually just as the usual spaces2) are. The whole point in using this unusual character (in­stead of an ordinary space) is, that it is not considered a word delimiter. In other words: It's supposed to be handled like a normal character which just happens to have no visible points.3)

While the NBSP character is quite often abused for nothing more than de­sign pur­po­ses4) there are a few occasions for its legitimate use. Con­si­de­ring, how­ever, today's keyboards and/or writing habits, it's not that easy to actually type in the NBSP character where it's needed. This is especially true if one's going to write text which is to be stored in UTF-8 format5) because here are even two bytes used to represent the NBSP: byte #194 im­me­dia­te­ly fol­lo­wed by byte #160.

While modern textprocessing software6) often supports UTF-8 characters, Do­ku­Wi­ki pages are usually written and edited using a simple web-browser with (X)HTML input forms where it can get quite difficult to insert such a cha­rac­ter7). — This plugin tries to solve the problem.

Note from the comments: You can get most of the effect of this by simply creating a conf/entities.local.conf file and adding a line like this to it:

(nbsp)  

Same way you can add (tab) for 3-4 spaces and using an indented paragraph.

Usage

The markup syntax implemented by this plugin is quite simple and looks like either

(i.e. a backslash8) character followed by a space)9) or

~~SP~~

(for those of you who'd rather a more expressive markup). Whenever one of the­se cha­rac­ter sequences10) is found it will be replaced by the appropriate UTF-8 cha­rac­ters. That's all.

Personally I'd recommend to use the first variant (i.e. ) as it both seems to be more intuitive and needs less characters to type and store. Both ways of mark­up, how­ever, are replaced by the UTF-8 character sequence exactly the same.

Installation

It's quite easy to integrate this plugin with your DokuWiki:

  1. Download the source archive (~3KB) and un­pack it in your Doku­Wiki plug­in di­rec­tory {dokuwiki}/lib/plugins (make sure, in­clu­ded sub­di­rec­to­ries are un­packed cor­rectly); this will create the directory {dokuwiki}/lib/plugins/nbsp.
  2. Make sure both the new direc­tory and the files therein are read­able by the web-server e.g.
    	chown apache:apache dokuwiki/lib/plugins/* -Rc

You might as well use the plugin manager for installing or updating this plugin.

Plugin Source

Here comes the GPLed PHP source11) for those who'd like to scan it be­fore actu­ally in­stal­ling it:

<?php
if (! class_exists('syntax_plugin_nbsp')) {
  if (! defined('DOKU_PLUGIN')) {
    if (! defined('DOKU_INC')) {
      define('DOKU_INC', realpath(dirname(__FILE__) . '/../../') . '/');
    } // if
    define('DOKU_PLUGIN', DOKU_INC . 'lib/plugins/');
  } // if
  // Include parent class:
  require_once(DOKU_PLUGIN . 'syntax.php');
 
/**
 * <tt>syntax_plugin_nbsp.php </tt>- A PHP4 class that provides the
 * ability to insert non-breaking spaces in <tt>DokuWiki</tt> page.
 *
 * <p>
 * To actually use this plugin just add <tt>\\ </tt> (i.e. backslash
 * space) or <tt>~~SP~~</tt> in a DokuWiki page. This will be expanded
 * to the UTF-8 character sequence.
 * </p><pre>
 *  Copyright (C) 2005, 2007 DFG/M.Watermann, D-10247 Berlin, FRG
 *      All rights reserved
 *    EMail : &lt;support@mwat.de&gt;
 * </pre>
 * <div class="disclaimer">
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either
 * <a href="http://www.gnu.org/licenses/gpl.html">version 3</a> of the
 * License, or (at your option) any later version.<br>
 * This software is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 * See the GNU General Public License for more details.
 * </div>
 * @author <a href="mailto:support@mwat.de">Matthias Watermann</a>
 * @version <tt>$Id: syntax_plugin_nbsp.php,v 1.7 2007/08/15 12:36:19 matthias Exp $</tt>
 * @since created 27-Aug-2005
 */
class syntax_plugin_nbsp extends DokuWiki_Syntax_Plugin {
 
  /**
   * Tell the parser whether the plugin accepts syntax mode
   * <tt>$aMode</tt> within its own markup.
   *
   * <p>
   * This method always returns <tt>FALSE</tt> since no other data
   * can be nested inside a non-breaking space.
   * </p>
   * @param $aMode String The requested syntaxmode.
   * @return Boolean <tt>FALSE</tt> always.
   * @public
   * @see getAllowedTypes()
   */
  function accepts($aMode) {
    return FALSE;
  } // accepts()
 
  /**
   * Connect lookup patterns to lexer.
   *
   * @param $aMode String The desired rendermode.
   * @public
   * @see render()
   */
  function connectTo($aMode) {
    // 'verbose' pattern:
    $this->Lexer->addSpecialPattern('~~SP~~', $aMode, 'plugin_nbsp');
    // Don't match DokuWiki's linebreak markup:
    $this->Lexer->addSpecialPattern('(?<!\x5C)\x5C\x20',
      $aMode, 'plugin_nbsp');
    // in case a raw #160 was inserted (e.g. by copy&paste):
    $this->Lexer->addSpecialPattern('(?<![\x80-\xE2])\xA0',
      $aMode, 'plugin_nbsp');
  } // connectTo()
 
  /**
   * Get an associative array with plugin info.
   *
   * <p>
   * The returned array holds the following fields:
   * <dl>
   * <dt>author</dt><dd>Author of the plugin</dd>
   * <dt>email</dt><dd>Email address to contact the author</dd>
   * <dt>date</dt><dd>Last modified date of the plugin in
   * <tt>YYYY-MM-DD</tt> format</dd>
   * <dt>name</dt><dd>Name of the plugin</dd>
   * <dt>desc</dt><dd>Short description of the plugin (Text only)</dd>
   * <dt>url</dt><dd>Website with more information on the plugin
   * (eg. syntax description)</dd>
   * </dl>
   * @return Array Information about this plugin class.
   * @public
   * @static
   */
  function getInfo() {
    return array (
      'author' =>  'Matthias Watermann',
      'email' =>  'support@mwat.de',
      'date' =>  '2007-08-15',
      'name' =>  'NBSP Plugin',
      'desc' =>  'Include non-breaking spaces in wiki pages.',
      'url' =>  'http://www.dokuwiki.org/plugin:nbsp');
  } // getInfo()
 
  /**
   * Where to sort in?
   *
   * @return Integer <tt>176</tt>.
   * @public
   * @static
   */
  function getSort() {
    return 176;
  } // getSort()
 
  /**
   * Get the type of syntax this plugin defines.
   *
   * @return String <tt>'substition'</tt> (a mispelled 'substitution').
   * @public
   * @static
   */
  function getType() {
    return 'substition';
  } // getType()
 
  /**
   * Handler to prepare matched data for the rendering process.
   *
   * @param $aMatch String The text matched by the patterns.
   * @param $aState Integer The lexer state for the match.
   * @param $aPos Integer The character position of the matched text.
   * @param $aHandler Object Reference to the Doku_Handler object.
   * @return Integer The given <tt>$aState</tt> value.
   * @public
   * @see render()
   * @static
   */
  function handle($aMatch, $aState, $aPos, &$aHandler) {
    return $aState;  // nothing more to do here ...
  } // handle()
 
  /**
   * Handle the actual output creation.
   *
   * <p>
   * The method checks for the given <tt>$aMode</tt> and returns
   * <tt>FALSE</tt> when a mode isn't supported. <tt>$aRenderer</tt>
   * contains a reference to the renderer object which is currently
   * handling the rendering. The contents of <tt>$aData</tt> is the
   * return value of the <tt>handle()</tt> method.
   * </p>
   * @param $aFormat String The output format to generate.
   * @param $aRenderer Object A reference to the renderer object.
   * @param $aData Integer The state value returned by <tt>handle()</tt>.
   * @return Boolean <tt>TRUE</tt> always.
   * @public
   * @see handle()
   */
  function render($aFormat, &$aRenderer, $aData) {
    if (DOKU_LEXER_SPECIAL == $aData) {
      // No test of '$aFormat' needed here:
      // The raw UTF-8 character sequence is the same anyway.
      $aRenderer->doc .= chr(194) . chr(160);
    } // if
    return TRUE;
  } // render()
 
} // class syntax_plugin_nbsp
} // if
//Setup VIM: ex: et ts=2 enc=utf-8 :
?>

Changes

2007-08-15:
* added GPL link and fixed some doc problems;

2007-01-16:
* replaced UTF8_ENTITY_NBSP const by raw UTF-8 characters in 'render()';

2007-01-06:
* minor internal changes to write out raw UTF-8 character sequence

2005-09-26:
# fixed problem with UTF-8 sequences with chr(160)

2005-08-29:
- removed unneeded method 'getAllowedTypes()';

2005-08-27:
+ initial release;

Matthias Watermann 2007-08-15

See also

Plugins by the same author

UTF-8 sequences with chr(160)

The following table is an incomplete extract of UTF-8 sequences containing character #160 gathered from:

ISO/IEC 10646-1:2000 aka Unicode v3.0.1 by Unicode Consortium

Please note that your browser might not be able to correctly show/render all characters13) in this table.

Hex Dec Chr ISO/IEC 10646-1:2000(E) Character Name
00A0 160 NO-BREAK SPACE
00E0 224 à LATIN SMALL LETTER A WITH GRAVE
0120 288 Ġ LATIN CAPITAL LETTER G WITH DOT ABOVE
0160 352 Š LATIN CAPITAL LETTER S WITH CARON
01A0 416 Ơ LATIN CAPITAL LETTER O WITH HORN
01E0 480 Ǡ LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
0260 608 ɠ LATIN SMALL LETTER G WITH HOOK
02A0 672 ʠ LATIN SMALL LETTER Q WITH HOOK
02E0 736 ˠ MODIFIER LETTER SMALL GAMMA
0320 800 ̠ COMBINING MINUS SIGN BELOW
0360 864 ͠ COMBINING DOUBLE TILDE
03A0 928 Π GREEK CAPITAL LETTER PI
03E0 992 Ϡ GREEK LETTER SAMPI
0420 1056 Р CYRILLIC CAPITAL LETTER ER
0460 1120 Ѡ CYRILLIC CAPITAL LETTER OMEGA
04A0 1184 Ҡ CYRILLIC CAPITAL LETTER BASHKIR KA
04E0 1248 Ӡ CYRILLIC CAPITAL LETTER ABKHASIAN DZE
05A0 1440 ֠ HEBREW ACCENT TELISHA GEDOLA
05E0 1504 נ HEBREW LETTER NUN
0660 1632 ٠ ARABIC-INDIC DIGIT ZERO
06A0 1696 ڠ ARABIC LETTER AIN WITH THREE DOTS ABOVE
06E0 1760 ۠ ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
07A0 1952 ޠ THAANA LETTER TO
0920 2336 DEVANAGARI LETTER TTHA
0960 2400 DEVANAGARI LETTER VOCALIC RR
0E20 3616 THAI CHARACTER PHO SAMPHAO
10A0 4256 GEORGIAN CAPITAL LETTER AN
10E0 4320 GEORGIAN LETTER RAE
1120 4384 HANGUL CHOSEONG PIEUP-TIKEUT
1160 4448 HANGUL JUNGSEONG FILLER
11A0 4512 HANGUL JUNGSEONG ARAEA-U
11E0 4576 HANGUL JONGSEONG MIEUM-CHIEUCH
1220 4640 ETHIOPIC SYLLABLE SZA
1260 4704 ETHIOPIC SYLLABLE BA
12A0 4768 ETHIOPIC SYLLABLE GLOTTAL A
12E0 4832 ETHIOPIC SYLLABLE ZHA
1320 4896 ETHIOPIC SYLLABLE THA
13A0 5024 CHEROKEE LETTER A
13E0 5088 CHEROKEE LETTER TLO
1420 5152 CANADIAN SYLLABICS FINAL GRAVE
1460 5216 CANADIAN SYLLABICS WEST-CREE TWOO
14A0 5280 CANADIAN SYLLABICS NASKAPI CWAA
14E0 5344 CANADIAN SYLLABICS LWII
1520 5408 CANADIAN SYLLABICS WEST-CREE SHWOO
1560 5472 CANADIAN SYLLABICS THI
15A0 5536 CANADIAN SYLLABICS LHI
15E0 5600 CANADIAN SYLLABICS CARRIER THI
1620 5664 CANADIAN SYLLABICS CARRIER JJI
1660 5728 CANADIAN SYLLABICS CARRIER TSA
16A0 5792 RUNIC LETTER FEHU FEOH FE F
16E0 5856 RUNIC LETTER EAR
1E20 7712 LATIN CAPITAL LETTER G WITH MACRON
1E60 7776 LATIN CAPITAL LETTER S WITH DOT ABOVE
1EA0 7840 LATIN CAPITAL LETTER A WITH DOT BELOW
1EE0 7904 LATIN CAPITAL LETTER O WITH HORN AND TILDE
1F20 7968 GREEK SMALL LETTER ETA WITH PSILI
1F60 8032 GREEK SMALL LETTER OMEGA WITH PSILI
1FA0 8096 GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
1FE0 8160 GREEK SMALL LETTER UPSILON WITH VRACHY
2020 8224 DAGGER
20A0 8352 EURO-CURRENCY SIGN
20E0 8416 COMBINING ENCLOSING CIRCLE BACKSLASH
2120 8480 SERVICE MARK
2160 8544 ROMAN NUMERAL ONE
21A0 8608 RIGHTWARDS TWO HEADED ARROW
21E0 8672 LEFTWARDS DASHED ARROW
2220 8736 ANGLE
2260 8800 NOT EQUAL TO
22A0 8864 SQUARED TIMES
22E0 8928 DOES NOT PRECEDE OR EQUAL
2320 8992 TOP HALF INTEGRAL
2360 9056 APL FUNCTIONAL SYMBOL QUAD COLON
2420 9248 SYMBOL FOR SPACE
2460 9312 CIRCLED DIGIT ONE
24A0 9376 PARENTHESIZED LATIN SMALL LETTER E
24E0 9440 CIRCLED LATIN SMALL LETTER Q
2520 9504 BOX DRAWINGS VERTICAL HEAVY AND RIGHT LIGHT
2560 9568 BOX DRAWINGS DOUBLE VERTICAL AND RIGHT
25A0 9632 BLACK SQUARE
25E0 9696 UPPER HALF CIRCLE
2620 9760 SKULL AND CROSSBONES
2660 9824 BLACK SPADE SUIT
2720 10016 MALTESE CROSS
27A0 10144 HEAVY DASHED TRIANGLE-HEADED RIGHTWARDS ARROW
2800 10240 BRAILLE PATTERN BLANK
2801 10241 BRAILLE PATTERN DOTS-1
2802 10242 BRAILLE PATTERN DOTS-2
2803 10243 BRAILLE PATTERN DOTS-12
2804 10244 BRAILLE PATTERN DOTS-3
2805 10245 BRAILLE PATTERN DOTS-13
2806 10246 BRAILLE PATTERN DOTS-23
2807 10247 BRAILLE PATTERN DOTS-123
2808 10248 BRAILLE PATTERN DOTS-4
2809 10249 BRAILLE PATTERN DOTS-14
280A 10250 BRAILLE PATTERN DOTS-24
280B 10251 BRAILLE PATTERN DOTS-124
280C 10252 BRAILLE PATTERN DOTS-34
280D 10253 BRAILLE PATTERN DOTS-134
280E 10254 BRAILLE PATTERN DOTS-234
280F 10255 BRAILLE PATTERN DOTS-1234
2810 10256 BRAILLE PATTERN DOTS-5
2811 10257 BRAILLE PATTERN DOTS-15
2812 10258 BRAILLE PATTERN DOTS-25
2813 10259 BRAILLE PATTERN DOTS-125
2814 10260 BRAILLE PATTERN DOTS-35
2815 10261 BRAILLE PATTERN DOTS-135
2816 10262 BRAILLE PATTERN DOTS-235
2817 10263 BRAILLE PATTERN DOTS-1235
2818 10264 BRAILLE PATTERN DOTS-45
2819 10265 BRAILLE PATTERN DOTS-145
281A 10266 BRAILLE PATTERN DOTS-245
281B 10267 BRAILLE PATTERN DOTS-1245
281C 10268 BRAILLE PATTERN DOTS-345
281D 10269 BRAILLE PATTERN DOTS-1345
281E 10270 BRAILLE PATTERN DOTS-2345
281F 10271 BRAILLE PATTERN DOTS-12345
2820 10272 BRAILLE PATTERN DOTS-6
2821 10273 BRAILLE PATTERN DOTS-16
2822 10274 BRAILLE PATTERN DOTS-26
2823 10275 BRAILLE PATTERN DOTS-126
2824 10276 BRAILLE PATTERN DOTS-36
2825 10277 BRAILLE PATTERN DOTS-136
2826 10278 BRAILLE PATTERN DOTS-236
2827 10279 BRAILLE PATTERN DOTS-1236
2828 10280 BRAILLE PATTERN DOTS-46
2829 10281 BRAILLE PATTERN DOTS-146
282A 10282 BRAILLE PATTERN DOTS-246
282B 10283 BRAILLE PATTERN DOTS-1246
282C 10284 BRAILLE PATTERN DOTS-346
282D 10285 BRAILLE PATTERN DOTS-1346
282E 10286 BRAILLE PATTERN DOTS-2346
282F 10287 BRAILLE PATTERN DOTS-12346
2830 10288 BRAILLE PATTERN DOTS-56
2831 10289 BRAILLE PATTERN DOTS-156
2832 10290 BRAILLE PATTERN DOTS-256
2833 10291 BRAILLE PATTERN DOTS-1256
2834 10292 BRAILLE PATTERN DOTS-356
2835 10293 BRAILLE PATTERN DOTS-1356
2836 10294 BRAILLE PATTERN DOTS-2356
2837 10295 BRAILLE PATTERN DOTS-12356
2838 10296 BRAILLE PATTERN DOTS-456
2839 10297 BRAILLE PATTERN DOTS-1456
283A 10298 BRAILLE PATTERN DOTS-2456
283B 10299 BRAILLE PATTERN DOTS-12456
283C 10300 BRAILLE PATTERN DOTS-3456
283D 10301 BRAILLE PATTERN DOTS-13456
283E 10302 BRAILLE PATTERN DOTS-23456
283F 10303 BRAILLE PATTERN DOTS-123456
2860 10336 BRAILLE PATTERN DOTS-67
28A0 10400 BRAILLE PATTERN DOTS-68
28E0 10464 BRAILLE PATTERN DOTS-678


For more information please refer to the Unicode Consortium.

Discussion

Hints, comments, suggestions …

  • Before writing this little plugin I tried to use the file conf/custom.conf for this task. But this didn't work because for one reason or the other that file isn't used at all by DokuWiki (2005-07-13). Greping the sources showed that the file isn't referenced anywhere. I dunno whether this is a bug or in­ten­ded be­ha­viour.

You can get most of the effect of this by simply creating a conf/entities.local.conf file and adding a line like this to it. – Adam Shand 2007-01-29

(nbsp) &nbsp;

I am new to DokuWiki (7days). I installed your plugin in a locally installed DokuWiki and a test one available on the net. Since then, when there is the letter “à” (&agrave;), this letter is replaced by a space or a question mark followed by a space according to the browser used. All my files are in UTF-8. The funny thing is that you are detecting the sequence “C2A0” and this letter is “C3A0” in UTF-8. You can have a look to the page http://carnetweb.info/wiki/doku.php?id=developement at the end of the line starting with “nbsp” there is the sequence “aàa”.

Thanks for spotting this! Please update your installation by using the fixed code above.
Matthias Watermann 2005-09-27 14:17

PS: sorry to report here, but I failed to subscribe to the mailing list. Each time the confimation message create a “failure notice” as follow:

<ecartis@freelists.org>:
206.53.239.180 does not like recipient.
Remote host said: 554 <26.mail-out.ovh.net[213.186.42.179]>: Client host rejected: Access denied
Giving up on 206.53.239.180.

– francois DOT granger AT gmail DOT com

replace the third pattern in function connectTo with '\xC2\xA0'. There does seem to be a problem with PHP's preg and these characters, however to my mind the pattern should be as I described as it doesn't make sense to leave the “\xC2” hanging on its own when there is a pattern match. — Christopher Smith 2005-09-26 02:02
In this case, I fear, it's not PHP's fault. My intention (as stated in the source comment) was to replace chr(160) occurences which where not proper UTF-8 encoded. But, alas, it didn't come to my mind at that time, that chr(160) can be part of another UTF-8 sequence. So while your suggestion would (X)HTML encode the correct nbsp UTF-8 sequence (which is a good thing in itself) it does not was I intended in the first place, it's in fact the opposite: while I wanted to say “find all #160 that are not UTF-8 encoded” your proposal says “find #160 that is UTF-8 encoded”. — As about the chr(194) (0xC2): it was not supposed to “hanging on its own” as the RegEx means that chr(194) must not in front of chr(160) and, as Francois reported, the RegEx as such does do exactly that, with the unintended side effects he mentioned :-(
Matthias Watermann 2005-09-26 07:44
It was late :? (and I just realised there is no oops smiley in Dokuwiki) I went by the comments without paying enough attention to the regex. I guess you need a more complex pattern that attempts to match invalid UTF-8 byte sequences or do away with that pattern altogether. I guess from a conceptual point of view a plugin shouldn't really have to validate the byte sequence used in raw wiki page - especially when it is outside the syntax proposed by the plugin. — Christopher Smith 2005-09-26 09:40

Had UTF-8 problems with this one. Deleting

“$this→Lexer→addSpecialPattern('(?<![\x80-\xE2])\xA0', $aMode, 'plugin_nbsp');”

solved it. Hope it helps others.


I'm Korean user. And I'm using 2007-08-15 version. When I enables this plugin, rows contains some unicode letter which is from U+B800 to U+B83F and from U+C800 to U+C83F are NOT displayed. Please refer dokuwiki issue #988 Thank you in advance :) 2016-04-22

1) NBSP: ISO-8859-1 char #160 and UTF-8 char-sequence #194#160
2) space: ASCII char #32 — not to be confused with blanks, ASCII char #255
3) In consequence it will not be used to break lines (word wrap) and it won't get expanded if the renderer is going to justify text by moving words in a line of text.
4) which are usually dealt with better by use of CSS
5) which is the default text format used for DokuWiki's pages
6) e.g. OpenOffice.org and most GNU/Linux editors
7) see the BOMfix plugin for another way to edit
8) backslash: ASCII char #92
9) One could call this an “escaped space
10) you may use whichever variant you like and may even mix them in the same page
11) The comments within the source file are suitable for the OSS doxygen tool, a do­cu­men­ta­tion sy­stem for C++, C, Java, Ob­jec­tive-C, Python, IDL and to some extent PHP, C#, and D. — Since I'm working with dif­fe­rent pro­gram­ming lan­gua­ges it's a great ease to have one tool that handles the docs for all of them.
12) obsoleted by incorporating its ability into the Code plugin
13) All non-ASCII characters to view on screen (or print out) depend on the fonts actually installed with your OS and GUI. If there are no UTF-8 enabled fonts (or incomplete ones) the browser will kind of “fallback” to a browser-dependent default character.
plugin/nbsp.txt · Last modified: 2016-04-22 04:48 by 210.205.62.177