DokuWiki

It's better when it's simple

User Tools

Site Tools


tips:pdfexport:htmldoc

This is an old revision of the document!


HTMLDOC

HTMLDOC is a free, high quality HTML to PDF converter. The only drawback is, that it doesn't support CSS in its current version. (You can still achieve good results!) The big advantage is, that you don't need to install something else. (no ghostscript e.g.)
To use htmldoc in your wiki do the following:

  • Install htmldoc (pretty easy)
  • Add the Export to PDF button as described in common_changes.
  • Create a temporary directory that the webserver can write to for the intermediate step.
  • In the function act_export, in inc/actions.php, add this (just after the “global” lines):
      if($act == 'export_pdf'){
        pdfmake(  p_wiki_xhtml($ID,$REV,false)  );
        exit;
      }
  • Modify inc/common.php: Append:
    function pdfmake($text){
      $dir="/home/wikiroot/files/pdfexporttmp/"; 
      $filenameInput=$dir."input.html";
      $filenameOutput=$dir."output.pdf";
     
    #clean up the string (so we don't have any artifacts)
      $text = umlaute($text); #this is for german umlaute
      $text = preg_replace("'<div class=\"toc\">\n<div class=\"tocheader\">.*?</div>\n</div>'si",'',$text );
      $text = preg_replace("'<a[^>]*?></a>'si", '', $text );
     
      $text = str_replace('&ldquo;','&quot;',$text );
      $text = str_replace('&rdquo;','&quot;',$text );
     
      $text = str_replace('<table', '<table border="1" ', $text);
     
    #write the string to temporary html-file
      $fp = fopen ($filenameInput, "w") or die ("cant create file");
      fwrite($fp,$text);
      fclose($fp);
     
    #convert using htmldoc
      $command = "/usr/bin/htmldoc --no-title -f " . $filenameOutput . " " . $filenameInput;
      system($command);
      system("exit(0)");
     
    #send to browser
      $filenameOutput=trim($filenameOutput);
      header("Content-type: application/pdf");
      header("Content-Disposition: attachment; filename=wikiexport_" . str_replace(':','_',$_GET["id"]) . ".pdf");
      header("Pragma: public"); // This makes the PDF-Export via https:// work with IE - it enables the IE to save the PDF file.
      $fd = @fopen($filenameOutput,"r");
      while(!feof($fd)){
        echo fread($fd,2048);
      }
      fclose($fd);
     
    #clean up temporary files
      system("rm " . $filenameInput);
      system("rm " . $filenameOutput);
    }
    function umlaute($text){
      return strtr($text,array(
          "ß"=>"&szlig;",
          "ä"=>"&auml;",
          "ü"=>"&uuml;",
          "ö"=>"&ouml;",
          "Ä"=>"&Auml;",
          "Ü"=>"&Uuml;",
          "Ö"=>"&Ouml;"));}

With IE6 you get a “file not found” error if you try to open the pdf generated by dokuwiki with the default pdf reader ( although you may save the file to your local disk). You can fix this by appending this line in the code above :

header('Cache-control: private, must-revalidate');

Da DokuWiki auf UTF-8 basiert und die PDFs die Umlaute nicht richtig anzeigen, kann die letzte Funktion durch diese ersetzt werden:

function umlaute($text){
  return strtr($text,array(
      utf8_encode("ß") => "&szlig;",
      utf8_encode("ä") => "&auml;",
      utf8_encode("ü") => "&uuml;",
      utf8_encode("ö") => "&ouml;",
      utf8_encode("Ä") => "&Auml;",
      utf8_encode("Ü") => "&Uuml;",
      utf8_encode("Ö") => "&Ouml;"));
}

HTML->PS->PDF

This is some of the steps and modifications that were previously at bobbaddeley.com.

  • Install the appropriate ghostscript and pdf packages: html2ps and Ghostscript's ps2pdfwr are both called. 1)
  • Add the Export to PDF button as described above.
  • Create a temporary directory that the webserver can write to for the intermediate step. (I used pdftmp)
  • In the function act_export, in inc/action.php, add this:
      }elseif($act == 'export_pdf'){
        $data = p_wiki_xhtml($ID,$REV,false);
        $data = preg_replace("'<div class=\"toc\"><div class=\"tocheader\">.*?</div></div>'si",'',$data);
        $data = preg_replace("'<a[^>]*?></a>'si", '', $data); #to prevent </a> after each header
        pdfmake($data);
        exit;
      }
  • Modify inc/common.php: Append:
    #PDF Exporting code added to dokuwiki from bobbaddeley.com 
    function pdfmake($text){
      $file="pdfout";
      $dir="pdftmp/";
      $fp = fopen ($dir.$file.".html", "w") or die ("cant create file");
      fwrite($fp,$text);
      fclose($fp);
      $filenameInput=$dir.$file.".html";
      $filenameOutput=$dir.$file.".pdf";
      $filenameTemp=$dir.$file.".ps";
      $filenameCapture=$dir.$file.".txt";
      $command1="/usr/bin/html2ps -o " . $filenameTemp . " " . $filenameInput;
      exec($command1);
      system("exit(0)");
      # MAKE SURE THAT YOU DO EXIT(0) OR UNIX REDIRECTION FAILS!! at least for me
      # now we use Ghostscript's ps2pdf perl script to get it into pdf format
      # assumes Ghoscript is installed
      $syscall2="/usr/bin/ps2pdfwr " . $filenameTemp . " " . $filenameOutput . " ";
      system($syscall2);
      system("exit(0)");
      # clean up temporary files
      system("rm " . $filenameInput);
      system("rm " . $filenameTemp);
      $filenameOutput=trim($filenameOutput);
      /**
      * output the pdf code, streaming it to the browser
      * the relevant headers are set so that hopefully the browser will recognise it
      */
      header("Content-type: application/pdf");
      header("Content-Disposition: inline; filename=".$filenameOutput);
      @readfile($filenameOutput);
    } 

For inline image support and internal dokuwiki refs to work in the pdf, modify the html2ps command:

$docbase="/var/www/html/";
$urlbase="http://my.dokuwiki.host/dokuwiki";
$command1="/usr/bin/html2ps -b " . $urlbase . " -r " . $docbase . " -o " . $filenameTemp . " " . $filenameInput;

Probably the $docbase and $urlbase path could be read from some variable which someone else knows…

Discussion

Great work thanks. But in order to get filenames based on page name with HTMLDoc, I had to change the line inserted in lib/tpl/template/main.php

<?php print html_btn('exportpdf',$ID,'',array('do' => 'export_pdf', 'id' => $ID)) ?>  <!-- inserted line -->

I also had to remove space after wikiexport in inc/common.php

header("Content-Disposition: attachment; filename=wikiexport" . str_replace(':','_',$_GET["id"]) . ".pdf");

To retrieve images from the wiki server (relative links, hope that it won't cause security issues) (I had problems with PNG files, so I converted them into JPEG format).

$text = preg_replace("'<img src=\"/(.*?)/lib/exe/fetch.php(.*?)media=(.*?)\"(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1/data/media/\\3\">", $text); # for uploaded images
$text = preg_replace("'<img src=\"/(.*?)\"'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1\"", $text); # for built-in images, smileys for example

The generated code for the table of contents contains endlines so

$text = preg_replace("'<div class=\"toc\">.<div class=\"tocheader\">.*?</div>.</div>'si",'',$text );

For the umlaute function (french support and images link support)

function umlaute($text){
  return strtr($text,array(
      "ß"=>"&szlig;",
      "ä"=>"&auml;",
      "ë"=>"&euml;",
      "ï"=>"&iuml;",
      "ü"=>"&uuml;",
      "ö"=>"&ouml;",
      "Ä"=>"&Auml;",
      "Ë"=>"&Euml;",
      "Ë"=>"&Iuml;",
      "Ü"=>"&Uuml;",
      "Ö"=>"&Ouml;",
      "â"=>"&acirc;",
      "ê"=>"&ecirc;",
      "î"=>"&icirc;",
      "ô"=>"&ocirc;",
      "û"=>"&ucirc;",
      "Â"=>"&Acirc;",
      "Ê"=>"&Ecirc;",
      "Î"=>"&Icirc;",
      "Ô"=>"&Ocirc;",
      "Û"=>"&Ucirc;",
      "à"=>"&agrave;",
      "è"=>"&egrave;",
      "ù"=>"&ugrave;",
      "é"=>"&eacute;",
      "À"=>"&Agrave;",
      "È"=>"&Egrave;",
      "Ù"=>"&Ugrave;",
      "É"=>"&Eacute;",
      "ç"=>"&ccedil;",
      "%3A"=>"/"
));
}

Virgile Gerecke 2005-08-19 15:00

Why does the background not get right to de pdf? I mean for <code> sections?

After a bunch of experimentation with the various techniques noted above, I found that the best way to get PDF versions of DokuWiki pages is simply to print to PDF. On Linux and Mac that should be included with your operating system, and Windows users can use PrimoPDF.

Hi,
with htmldoc-1.8.24, I have to modify inc/common.php:

 $command = "/usr/bin/htmldoc  --webpage --no-title -f " . $filenameOutput . " " . $filenameInput;

An HTMLDOC variant

Hi, my main problem was the following:

  • HTMLDOC can't handle the utf-8 code-page (I hope it will change at the future)
  • The pictures missed in the PDF document
  • etc.

So, I expanded the first HTMLDOC conversion code with some usable features:

  • At the first you need to make the common changes
  • You need to change the conf/dokuwiki.php → add the followings:
    /* PDF Export settings  */
    $conf['customtoc']   = '';               #''=use default TOC, any other=always use this TOC e.g. 'Contents'
    $conf['uselangtoc']  = 1;                #Use language dependent TOC (1=yes,0=no) - it will be ignored when use customtoc
    $conf['embedfonts']  = 1;                #1=use embedded fonts; 0=don't use /smaller pdf, but platform dependent/
    $conf['jpgrate']     = 0;                #Jpeg compression rate 0=off jpeg compression (smaller value -> smaller size, but worst quality)
    $conf['usecustomreplace'] = 1;           #Use custom replace table (replaces based on conf/replace.conf)
    $conf['pdfcp']       = 'windows-1250';   #Which code-page will be used for conversion to pdf
                                             #Possible values are:
                                             # 'windows-874','windows-1250','windows-1251','windows-1252',
                                             # 'windows-1253','windows-1254','windows-1255','windows-1256',
                                             # 'windows-1257','windows-1258','iso-8859-1','iso-8859-2','iso-8859-3',
                                             # 'iso-8859-4','iso-8859-5','iso-8859-6','iso-8859-7','iso-8859-8',
                                             # 'iso-8859-9','iso-8859-14','iso-8859-15'
    $conf['htmldocdir']  = '/usr/bin/';      #Folder of htmldoc binary
    $conf['pdfversion']  = 'pdf14';          #PDF version compatibility
    $conf['browserwidth']= 1280;             #Size of pictures is optimized for this size (your browser width) in your documents
  • You need to create conf/replace.conf file for your alternative changes. This is my custom conf:
    # Patterns configured here will be replaced by the following patterns.
    
    // Please create a new section for the code-page with @code-page tag
    
    # Replaces for Windows-1250 code-page
    #
    # If a tag thereisn't in the code-page you can create a missing-like
    # tag with multiple tags:
    
    @windows-1250
    &ldquo; &#132;                  # Left double quote
    &rdquo; &#148;                  # Right double quote
    &trade; &#153;                  # TradeMark
    &ndash; &#150;                  # Short dash --
    &mdash; &#151;                  # Long dash ---
    &larr;  &#139;&#150;            # Left single arrow <-
    &rarr;  &#150;&#155;            # Right single arrow ->
    &harr;  &#139;&#150;&#155;      # Horizontal single arrow <->
    &lArr;  &laquo;-                # Left double arrow <=
    &rArr;  -&raquo;                # Right double arrow =>
    &hArr;  &laquo;-&raquo;         # Horizontal double arrow <=>

:!: You need to use same 'pdfcp' code-page-string and @code-page string. You can use comments as two ways: with // or # delimiters.
You can use more @code-page sections in this file.

  • After these steps you need to append the following (modification of above version) PHP code to inc/common.php:
    function pdfmake($text){
      global $lang;
      global $conf;
     
      $dir=DOKU_INC."tmp/";
      $filenameInput=$dir."input.html";
      $filenameOutput=$dir."output.pdf";
     
    # Convert text and toctitle to destination code-page
      $text=iconv("utf-8",$conf['pdfcp'],$text);
    # Change toctitle if needed
      if ($conf['customtoc']) {
        $toctitle=$conf['customtoc'];
        }
      elseif ($conf['uselangtoc']) {
        $toctitle=$lang['toc'];
        }
      else {
        $toctitle="Table of contents";
      }
      $toctitle=iconv("utf-8",$conf['pdfcp'],$toctitle);
     
    # htmldoc compatible name-conversion
      $pdfcp=preg_replace("/windows/i","cp",$conf['pdfcp']);
      $text = preg_replace("'<div class=\"toc\"><div class=\"tocheader\">.*?</div></div>'si",'',$text );
      $text = preg_replace("'<a[^>]*?></a>'si", '', $text );
     
    # Execute changes based on replaces.conf
      $replacesf=DOKU_INC . "conf/replaces.conf";
      if ($conf['usecustomreplace'] && file_exists($replacesf)) {
        $allreplaces=file_get_contents($replacesf);
     
    # Delete comments from file
        $allreplaces=preg_replace("'(//.*|\s+#.*|^#.*)'",'',$allreplaces);
     
    # Legalize multiple white-spaces
        $allreplaces=preg_replace("'(\t+| +)'",' ',$allreplaces);
     
    # Delete unwanted spaces
        $allreplaces=preg_replace("'(^ +| +$)'",'',$allreplaces);
     
    # Delete multiple empty lines
        $allreplaces=preg_replace("'\n+'","\n",$allreplaces);
     
    # Split codepage sections
        $codepages=preg_split("'\n@'",$allreplaces,-1, PREG_SPLIT_NO_EMPTY);
        $cpreg=preg_quote($conf['pdfcp']);
     
    # Find the used codepage
        foreach ($codepages as $codepage) {
          if (preg_match("'" . $cpreg . "'si",$codepage)) {
            $replaces=preg_replace("'" . $cpreg . "\n'si",'',$codepage);
            break;
          }
        }
     
    # Split patterns
        $patterns=preg_split("'\n'",$replaces,-1, PREG_SPLIT_NO_EMPTY);
        foreach ($patterns as $onepair) {
    # Split pairs
          $pairarray=preg_split("' '",$onepair);
    # Make changes
          $text=str_replace($pairarray[0],$pairarray[1],$text);
        }
      }
     
      $text = preg_replace("'src=\"http://.*media='i",'src="'.DOKU_INC.'data/media/',$text); # Change links to path of images
      $text = preg_replace("'href=\"http://.*media=.*class=\"media\"'i",'href="" class="media"',$text); # remove picture link address
      $textarr = preg_split("/\n/",$text);
     
    # Find and change linked images
      $linkeds = preg_grep("'<a href=.*<img src=.* /></a>'i",$textarr);
      foreach ( $linkeds as $linked ) {
        $picture = preg_replace("/<a href=.*\">/i",'',$linked);
        $picture = preg_replace("'</a>'i",'',$picture);
        $found = "'".preg_quote($linked)."'";
        $text = preg_replace($found,$picture,$text);
      }
    # HTML compatibility -> htmldoc can use <br> instead of <br/>
      $text = str_replace('/>','>',$text);
      $text = str_replace('<table', '<table border="1" ', $text);
     
    #write the string to temporary html-file
      $fp = fopen ($filenameInput, "w") or die ("cant create file");
      fwrite($fp,$text);
      fclose($fp);
     
    #Use embedded fonts if needed
      if ($conf['embedfonts']) {
        $fontparam='--embedfonts';
      } else {
        $fontparam='';
      }
     
    #JPEG compression rate settings
      $jpeg=" --jpeg=" . $conf['jpgrate'];
     
    #PDF compatibility
      $pdf="-t " . $conf['pdfversion'];
     
    #Documentwidth
      $width=" --browserwidth " . $conf['browserwidth'];
     
    #convert using htmldoc
      $command = $conf['htmldocdir'] . "htmldoc " . $pdf . $width . $jpeg . " --charset " . $pdfcp . " --no-title " . $fontparam . " --toctitle \"" . $toctitle . "\" -f " . $filenameOutput . " " . $filenameInput;
      system($command);
      system("exit(0)");
     
    #send to browser
      $filenameOutput=trim($filenameOutput);
      header("Content-type: application/pdf");
      header("Content-Disposition: attachment; filename=dokuwikiexport_" . str_replace(':','_',$_GET["id"]) . ".pdf");
      $fd = @fopen($filenameOutput,"r");
      while(!feof($fd)){
        echo fread($fd,2048);
      }
      fclose($fd);
     
    #clean up temporary files
      system("rm " . $filenameInput);
      system("rm " . $filenameOutput);
    }

So, you can get better, nicer, more varied and more portable documents with these modifications.
Cheers, — Peter Szládovics 2005-11-17 20:09


HTMLDOC Variant Modifications

I couldn't get PDF export to handle images and links properly until I replaced the following lines in the inc/common.php (from the Peter's code shown immediately above):

  $text = preg_replace("'src=\"http://.*media='i",'src="'.DOKU_INC.'data/media/',$text); # Change links to path of images
  $text = preg_replace("'href=\"http://.*media=.*class=\"media\"'i",'href="" class="media"',$text); # remove picture link address

with these lines:

  $text = preg_replace("'<img src=\"/(.*?)/lib/exe/fetch.php(.*?)media=(.*?)\"(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1/lib/exe/fetch.php?media=\\3\">", $text); # for uploaded images
  $text = preg_replace("'<img src=\"/(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1>", $text); # for built-in images, smileys for example
  $text = preg_replace("'href=\"/(.*?)>'si","href=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1\">", $text); # for internal links

I also changed the temporary filenames so that static names were not used by changing (also in Peter's modifications to inc/common.php)

  $filenameInput=$dir."input.html";
  $filenameOutput=$dir."output.pdf";

to this:

  $filenameInput=tempnam($dir,"input_");
  $filenameOutput=tempnam($dir,"output_");

The PDF export is a great and very essential feature. Thanks! :-)Brian Dundon 2006-3-18 22:44

Remark on HTMLDOC Variant Modifications

I use the code and I get a PDF file. But when I try to open this file, I get “The file is damaged and could not be repaired”. The command line parameters are

htmldoc -t pdf12 --browserwidth 1280 --jpeg=0 --charset iso-8859-15 --no-title --embedfonts --toctitle "Inhaltsverzeichnis" -f /tmp/output_FSJlGa /tmp/input_GM1H6V

I tried with -t pdf14 and --charset windows-1252 too, but the same result. I use htmldoc-1.9.x-r1484. – Werner 2006-05-13 – P.S.: with HTMLDOC version 1.8.26 everything works fine. It may be a HTMLDOC issue since even the htmldoc.pdf generated via make is unreadable.

Remark 2 on HTMLDOC Variant Modifications

The updated code (from Brian Dundon) used above uses urls to find the images. This variant uses full paths on the unix/linux system to find the images. Urls won't work when logins are required for dokuwiki.

Note: still have to have a look at the line for changing path of built-in images

  $text = preg_replace("'src=\"http://.*media='i",'src="'.DOKU_INC.'data/media/',$text); # Change links to path of images
  $text = preg_replace("'href=\"http://.*media=.*class=\"media\"'mi",'href="" class="media"',$text); # remove picture link address
  $textarr = preg_split("/\n/",$text);
 
# Find and change linked images
  $linkeds = preg_grep("'<a href=.*<img src=.* /></a>'i",$textarr);
  foreach ( $linkeds as $linked ) {
    $picture = preg_replace("/<a href=.*\">/i",'',$linked);
    $picture = preg_replace("'</a>'i",'',$picture);
    $found = "'".preg_quote($linked)."'";
    $text = preg_replace($found,$picture,$text);
  }

with

  # find all user uploaded images (and make sure this code doesn't get treated as an image!
  preg_match_all ( '/media=([\./a-z].*?)"/mi', $text, $matches);
  # only use unique elements from array with matches following first parenthesis
  foreach (array_unique($matches[1]) as $match){
    # change namespace into directory syntax
    $newimg = preg_replace ('/:/', '/', $match );
    $text = preg_replace ( "/media=$match/m", "media=$newimg", $text );
  }
 
  $text = preg_replace('|<img src=".*?/lib/exe/fetch.php.*?media=(.*?)".*?>|mi',"<img src=\"" . DOKU_INC . "data/media/\\1\" />",$text); # for images
  $text = preg_replace('|<img src=".*?/(lib/images/.*?)".*?>|mi',"<img src=\"" . DOKU_INC . "\\1\" />",$text); # for internal images
  $text = preg_replace("|href=\"/(.*?)>|mi","href=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1\">", $text); # for internal links
#you can put this in or leave it out...  $text = preg_replace("|href=\".*?media=.*?class=\"media\"|i",'href="" class="media"',$text); # remove picture link address
  $textarr = preg_split("/\n/",$text);
 
# Find and change linked images
  $linkeds = preg_grep("|<a href=.*?<img src=.*? /></a>|i",$textarr);
  foreach ( $linkeds as $linked ) {
    $picture = preg_replace("|<a href=.*?\">|i",'',$linked);
    $picture = preg_replace("|</a>|i",'',$picture);
   $found = "'".preg_quote($linked)."'";
    $text = preg_replace($found,$picture,$text);
  }

Little remark:
Make sure conf/replace.conf is called conf/replaces.conf (with an 's').

Other little update:
swap

system("exit(0)");

with

system("exit 0");

The exit(0) generated an error in the apache error.log file.

PDFexport is very handy for dokuwiki. Thanx for the code. Hope this remark helps others as well. — F. Masselink 2006-10-24 11:33

Remark of Remark 2 on HTMLDOC Variant Modifications

i had a problem with preg_match_all, solved by changing the fallowing lines: replace in file inc/common.php

preg_match_all ( '/media=([\./a-z].*?)"/mi', $text, $matches);

with

preg_match_all ( '/media=([\.\/a-z].*?)"/mi', $text, $matches);

A.Chiapparini 2007-01-22 14:47

Modification for nice URLs

I'm using nice URLs with rewrite method and I must add this line for internal pictures to work. It goes to the other preg_replace lines.

$text = preg_replace('|<img src=".*?/_media/(.*?)\?.*?".*?>|mi',"<img src=\"" . DOKU_INC . "data/media/\\1\" />",$text);

Also I have troubles with filename of downloaded PDF, there is my modification of header line at the end of the pdfmake function. This sends filename containing wiki name and only the name of the wikipage, not the namespace location.

header("Content-Disposition: attachment; filename=".str_replace(' ','_',$conf['title']).'-'.end(split('/',$_GET["id"])).".pdf");

HTMLDOC recursive variant

My problem was that i needed support for child page export. It therefore choose to modify / hack An_HTMLDOC_variant found on this page. Some of the remarks / improvements to An_HTMLDOC_variant have also been included.

It will thus perform a recursive export of your current page. This means that any internal links will be followed and converted to PDF too. The internal links should copied to the PDF - meaning that they are click-able like they are in dokuwiki.

  • Follow the first 4 steps of HTMLDOC (on this page)
  • Then insert this into “inc/common.php”:
    function pdfmake($text)
    {
      //Variables used to stop the search for child pages
      global $pdfmake_recursion_level;
      global $pdfmake_recursion_current;
      global $pdfmake_links;
      $pdfmake_links = array();
     
      // This controls the depth in which it will search for subpages
      $pdfmake_recursion_level = 10;
      $pdfmake_recursion_current = 0;
     
      // Now search for children.
      $text = pdfmake_children($text);
     
      // And create the pdf
      pdfmake_inner($text);
    }
    function pdfmake_inner($text){
      global $lang;
      global $conf;
     
      $dir=DOKU_INC."tmp/";
      $filenameInput=$dir."input.html";
      $filenameOutput=$dir."output.pdf";
     
    # Convert text and toctitle to destination code-page
      $text=iconv("utf-8",$conf['pdfcp'],$text);
    # Change toctitle if needed
      if ($conf['customtoc']) {
        $toctitle=$conf['customtoc'];
        }
      elseif ($conf['uselangtoc']) {
        $toctitle=$lang['toc'];
        }
      else {
        $toctitle="Table of contents";
      }
      $toctitle=iconv("utf-8",$conf['pdfcp'],$toctitle);
     
    # htmldoc compatible name-conversion
      $pdfcp=preg_replace("/windows/i","cp",$conf['pdfcp']);
      $text = preg_replace("'<div class=\"toc\"><div class=\"tocheader\">.*?</div></div>'si",'',$text );
      $text = preg_replace("'<a[^>]*?></a>'si", '', $text );
     
    # Execute changes based on replaces.conf
      $replacesf=DOKU_INC . "conf/replaces.conf";
      if ($conf['usecustomreplace'] && file_exists($replacesf)) {
        $allreplaces=file_get_contents($replacesf);
     
    # Delete comments from file
        $allreplaces=preg_replace("'(//.*|\s+#.*|^#.*)'",'',$allreplaces);
     
    # Legalize multiple white-spaces
        $allreplaces=preg_replace("'(\t+| +)'",' ',$allreplaces);
     
    # Delete unwanted spaces
        $allreplaces=preg_replace("'(^ +| +$)'",'',$allreplaces);
     
    # Delete multiple empty lines
        $allreplaces=preg_replace("'\n+'","\n",$allreplaces);
     
    # Split codepage sections
        $codepages=preg_split("'\n@'",$allreplaces,-1, PREG_SPLIT_NO_EMPTY);
        $cpreg=preg_quote($conf['pdfcp']);
     
    # Find the used codepage
        foreach ($codepages as $codepage) {
          if (preg_match("'" . $cpreg . "'si",$codepage)) {
            $replaces=preg_replace("'" . $cpreg . "\n'si",'',$codepage);
            break;
          }
        }
     
    # Split patterns
        $patterns=preg_split("'\n'",$replaces,-1, PREG_SPLIT_NO_EMPTY);
        foreach ($patterns as $onepair) {
    # Split pairs
          $pairarray=preg_split("' '",$onepair);
    # Make changes
          $text=str_replace($pairarray[0],$pairarray[1],$text);
        }
      }
     
      $text = preg_replace("'<img src=\"/(.*?)/lib/exe/fetch.php(.*?)media=(.*?)\"(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1/lib/exe/fetch.php?media=\\3\">", $text); # for uploaded images
      $text = preg_replace("'<img src=\"/(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1>", $text); # for built-in images, smileys for example
      $text = str_replace("href=\"/doku.php?id=", "href=\"#", $text); //correct internal links  
     
      $textarr = preg_split("/\n/",$text);
     
    # Find and change linked images
      $linkeds = preg_grep("'<a href=.*<img src=.* /></a>'i",$textarr);
      foreach ( $linkeds as $linked ) {
        $picture = preg_replace("/<a href=.*\">/i",'',$linked);
        $picture = preg_replace("'</a>'i",'',$picture);
        $found = "'".preg_quote($linked)."'";
        $text = preg_replace($found,$picture,$text);
      }
    # HTML compatibility -> htmldoc can use <br> instead of <br/>
      $text = str_replace('/>','>',$text);
      $text = str_replace('<table', '<table border="1" ', $text);
     
    #write the string to temporary html-file
      $fp = fopen ($filenameInput, "w") or die ("cant create file");
      fwrite($fp,$text);
      fclose($fp);
     
    #Use embedded fonts if needed
      if ($conf['embedfonts']) {
        $fontparam='--embedfonts';
      } else {
        $fontparam='';
      }
     
    #JPEG compression rate settings
      $jpeg=" --jpeg=" . $conf['jpgrate'];
     
    #PDF compatibility
      $pdf="-t " . $conf['pdfversion'];
     
    #Documentwidth
      $width=" --browserwidth " . $conf['browserwidth'];
     
    #convert using htmldoc
      $command = $conf['htmldocdir'] . "htmldoc " . $pdf . $width . $jpeg . " --charset " . "--webpage" . $pdfcp . " --no-title " . $fontparam . " --toctitle \"" . $toctitle . "\" -f " . $filenameOutput . " " . $filenameInput;
      system($command);
      system("exit(0)");
     
    #send to browser
      $filenameOutput=trim($filenameOutput);
      header("Content-type: application/pdf");
      header("Content-Disposition: attachment; filename=dokuwikiexport_" . str_replace(':','_',$_GET["id"]) . ".pdf");
      $fd = @fopen($filenameOutput,"r");
      while(!feof($fd)){
        echo fread($fd,2048);
      }
      fclose($fd);
     
    #clean up temporary files
      system("rm " . $filenameInput);
      system("rm " . $filenameOutput);
    }
     
    //search for child pages and render their html
    function pdfmake_children($text)
    {
      //Extract recusion levels
      global $pdfmake_recursion_level;
      global $pdfmake_recursion_current;
      global $pdfmake_links;
     
      $links = array();
     
      $pdfmake_recursion_current += 1;
      //will contain all subpages at the end.
      $innerText = '';	
     
      //find all links on page
      $regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
      preg_match_all($regex_pattern,$text,$matches);
     
      //The matching pairs will be listed in matches[1]. Sort these matches, so that subnamspaces comes before their parent namespaces.
      sort($matches[1]);
      for($i=0; $i< count($matches[1]); $i++) {
        //extract the internal dokuwiki id of the subpage. This is needed to perform the rendering
        $link = substr($matches[1][$i], stripos($matches[1][$i],'title=')+7);
     
        //Dont add a page which has already been included
        if(!in_array($link, $pdfmake_links)) {
          // Call the dokuwiki renderer, if the link does not start with http (then it is not an internal link)
          if(substr($link, 0, 4) != 'http') {
          $innerText .= p_wiki_xhtml($link,'',false);
     
          //Add the link to the collection so it can be sanitized later.
          $pdfmake_links[] = $link;
          $links[] = $link;
          }
        }
      }
    	//Recurse into the next level of internal links
      if($pdfmake_recursion_current < $pdfmake_recursion_level) {
        $innerText = pdfmake_children($innerText);
      }
      $text = pdfmake_correctlinks($text, $links);
      $innerText = pdfmake_correctlinks($innerText, $links);
     
      // return all the text to caller.
      return $text.$innerText;
    }
    function pdfmake_correctlinks($text, $links)
    { 
      for($i = 0; $i < count($links); $i++) {
        $link = $links[$i];
        // this $link is the full path to the dokuwiki page. However, in the HTML output, it is only the name after the last ":" which is inserted as id for a heading.
        $text = str_replace($link, substr($link, strrpos($link, ':')+1), $text);
      }
     
      return $text;
    }

Remember that I only tested this on my own sever (on which it works). So expect bugs and / or strange behavior.

Nicklas Overgaard 2009-10-31 16:45 GMT+1

HTMLDOC and OS X

The official HTMLDOC packages for OS X are not free. I did find another package at http://www.bluem.net/downloads/htmldoc_en/

In inc/common.php I had to change

$command = "/usr/bin/htmldoc --no-title -f " . $filenameOutput . " " . $filenameInput;

to

$command = "/usr/local/bin/htmldoc --webpage --outfile " . $filenameOutput . " " . $filenameInput;

David McCallum 2006-06-30 16:09


Installing HTMLDOC

Installing HTML should be pretty easy, as written above.
But sadly I don't know how to install it on the server (only got uploadpermission by ftp-client).
I downloaded htmldoc-1.9.x-r1474 and uploaded the folder but can't find an installer package.

Does anyone has a simple step by step instruction for beginners?
TANK YU

if you are using a Debian or derived operative system than you can use the following command:

# apt-get install htmldoc

Remark of Remark 3 on HTMLDOC Variant Modifications

Improvements:

  1. not need of a writable directory: use the system temporary directory (automatic)
  2. use of unique filename so more people can export file without problems

if you replace

  $dir=DOKU_INC."tmp/";
  $filenameInput=$dir."input_";
  $filenameOutput=$dir."output_";

with

  $filenameInput=tempnam('','html');
  $filenameOutput=tempnam('','pdf');

HTMLDOC request

I think that will be very useful if you can create a page with the list of wiki page to export and HTMLDOC export all these pages into a PDF file.
For example if the wiki page start with a special string (example: HTMLDOC_EXTRACT) then he must extract all pages listed below.
Example:

HTMLDOC_EXTRACT

  * [[tips]]
    * [[tips:pdfexport]]
    * [[tips:browserlanguagedetection]]

So you can create pages from which you can extract a PDF file based on more wiki pages

Config problem with HTMLDOC variant

If you had added all the $conf['abc'] lines to dokuwiki.php file and you get an 'Undefined Settings' section on your config page like

$conf['browserwidth']No setting metadata.
$conf['customtoc'] No setting metadata.

you have to declare all value in your config.metadata.php

1)
On Debian: These are included in packages html2ps and gs-common. All other needed packages (gs for example) are installed using apt-get automatically.
tips/pdfexport/htmldoc.1257004944.txt.gz · Last modified: 2009-10-31 17:02 by 90.184.79.243

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki