DokuWiki

It's better when it's simple

User Tools

Site Tools


tips:extract_links

The following snippet will extract all external links from the wiki. They are extracted with their surrounding context. The context includes everything from the preceding beginning of the line OR link OR beginning of the sentence to the succeeding end of the line OR link OR end of the sentence.

<?php
 
// where your dokuwiki data resides
$dir = '/var/lib/dokuwiki/data/pages/';
 
$pages = explode("\n", `find "$dir" -iname *.txt`);
 
foreach ($pages as $page)
{
  $contents = file_get_contents($page);
 
  //bracketed links: [[http://www.bla...|link text]]
  $regexBracketed = '\[\[(?P<bracketed>https?:\/\/[^\]]+)\]\]';
  // free-standing links: www.google.com (no http:// necessarry)
  $regexFree = '(?P<free>(?:https?:\/\/)?(?:www|ftp)\.[^\s]+)';
  // punctuation marks to find the end or beginning of a sentence
  $punctuation = '\.|\!|\?';
  // get the rest of the sentence, only if there is not another link in the same sentence
  $suffix = '(?=.*?((www|ftp)\.|\[\[https?\:\/\/).*?($|'.$punctuation.'))|(.*?(?:$|'.$punctuation.'))';
 
  $regex = '/(?:^|'.$punctuation.')?(?P<prefix>.*?)(?:'.$regexBracketed.'|'.$regexFree.')(?P<suffix>'.$suffix.')/m';
 
  preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
 
  foreach ($matches as $match)
  {
    // $match[0] contains the entire match as described above
 
    // see which kind of link we discovered
    $hit = (!empty($match['bracketed'][0]) ? $match['bracketed']: $match['free']);
 
    // split into link|linktext if neccesarry
    if (strpos($hit, '|') === false)
    {
      $url = $hit;
      $title = $hit;
    }
    else
    {
      list($url, $title) = explode('|', $hit, 2);
    }
  }
}

For further questions you can contact me at g [dot] sorst [at] clickforknowledge [dot] com.

tips/extract_links.txt · Last modified: 2009-11-03 23:55 by 85.127.232.158

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki