DokuWiki

It's better when it's simple

User Tools

Site Tools


Sidebar

Translations of this page?:

Learn about DokuWiki

Advanced Use

Corporate Use

Our Community


Follow us on Facebook, Twitter and other social networks.

tips:extract_links

The following snippet will extract all external links from the wiki. They are extracted with their surrounding context. The context includes everything from the preceding beginning of the line OR link OR beginning of the sentence to the succeeding end of the line OR link OR end of the sentence.

<?php
 
// where your dokuwiki data resides
$dir = '/var/lib/dokuwiki/data/pages/';
 
$pages = explode("\n", `find "$dir" -iname *.txt`);
 
foreach ($pages as $page)
{
  $contents = file_get_contents($page);
 
  //bracketed links: [[http://www.bla...|link text]]
  $regexBracketed = '\[\[(?P<bracketed>https?:\/\/[^\]]+)\]\]';
  // free-standing links: www.google.com (no http:// necessarry)
  $regexFree = '(?P<free>(?:https?:\/\/)?(?:www|ftp)\.[^\s]+)';
  // punctuation marks to find the end or beginning of a sentence
  $punctuation = '\.|\!|\?';
  // get the rest of the sentence, only if there is not another link in the same sentence
  $suffix = '(?=.*?((www|ftp)\.|\[\[https?\:\/\/).*?($|'.$punctuation.'))|(.*?(?:$|'.$punctuation.'))';
 
  $regex = '/(?:^|'.$punctuation.')?(?P<prefix>.*?)(?:'.$regexBracketed.'|'.$regexFree.')(?P<suffix>'.$suffix.')/m';
 
  preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
 
  foreach ($matches as $match)
  {
    // $match[0] contains the entire match as described above
 
    // see which kind of link we discovered
    $hit = (!empty($match['bracketed'][0]) ? $match['bracketed']: $match['free']);
 
    // split into link|linktext if neccesarry
    if (strpos($hit, '|') === false)
    {
      $url = $hit;
      $title = $hit;
    }
    else
    {
      list($url, $title) = explode('|', $hit, 2);
    }
  }
}

For further questions you can contact me at g [dot] sorst [at] clickforknowledge [dot] com.

tips/extract_links.txt · Last modified: 2009-11-03 23:55 by 85.127.232.158