Translations of this page?:

The following snippet will extract all external links from the wiki. They are extracted with their surrounding context. The context includes everything from the preceding beginning of the line OR link OR beginning of the sentence to the succeeding end of the line OR link OR end of the sentence.

<?php
 
// where your dokuwiki data resides
$dir = '/var/lib/dokuwiki/data/pages/';
 
$pages = explode("\n", `find "$dir" -iname *.txt`);
 
foreach ($pages as $page)
{
  $contents = file_get_contents($page);
 
  //bracketed links: [[http://www.bla...|link text]]
  $regexBracketed = '\[\[(?P<bracketed>https?:\/\/[^\]]+)\]\]';
  // free-standing links: www.google.com (no http:// necessarry)
  $regexFree = '(?P<free>(?:https?:\/\/)?(?:www|ftp)\.[^\s]+)';
  // punctuation marks to find the end or beginning of a sentence
  $punctuation = '\.|\!|\?';
  // get the rest of the sentence, only if there is not another link in the same sentence
  $suffix = '(?=.*?((www|ftp)\.|\[\[https?\:\/\/).*?($|'.$punctuation.'))|(.*?(?:$|'.$punctuation.'))';
 
  $regex = '/(?:^|'.$punctuation.')?(?P<prefix>.*?)(?:'.$regexBracketed.'|'.$regexFree.')(?P<suffix>'.$suffix.')/m';
 
  preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
 
  foreach ($matches as $match)
  {
    // $match[0] contains the entire match as described above
 
    // see which kind of link we discovered
    $hit = (!empty($match['bracketed'][0]) ? $match['bracketed']: $match['free']);
 
    // split into link|linktext if neccesarry
    if (strpos($hit, '|') === false)
    {
      $url = $hit;
      $title = $hit;
    }
    else
    {
      list($url, $title) = explode('|', $hit, 2);
    }
  }
}

For further questions you can contact me at g [dot] sorst [at] clickforknowledge [dot] com.

tips/extract_links.txt · Last modified: 2009/11/03 23:55 by 85.127.232.158
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 3.0 Unported
Imprint Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki
WikiForumIRCBugsGitXRefTranslate