ds2321
2013-01-29 13:54 阅读 127
已采纳

使用Regex以多种格式捕获日期

I'm working on an app which scrapes local websites to create a database of upcoming events, and I'm trying to use Regex to catch as many formats of dates as possible.

Consider the following sentence fragments:

  • "The focus of the seminar, on Saturday 2nd February 2013 will be [...]"
  • "Valentines Special @ The Radisson, Feb 14th"
  • "On Friday the 15th of February, a special Hollywood themed [...]"
  • "Symposium on Childhood Play on Friday, February 8th"
  • "Hosting a craft workshop March 9th - 11th in the old [...]"

I want to be able to scan these and catch as many dates as possible. At the moment I'm doing this in what is probably a flawed way (I'm not great at regex), going through several regex statements one after the other, like this

/([0-9]+?)(st|nd|rd|th) (of)? (Jan|Feb|Mar|etc)/i
/([0-9]+?)(st|nd|rd|th) (of)? (January|February|March|Etcetera)/i
/(Jan|Feb|Mar|etc) ([0-9]+?)(st|nd|rd|th)/i
/(January|February|March|Etcetera) ([0-9]+?)(st|nd|rd|th)/i

I could merge these all into one giant regex statement, but it seems like there must be a cleaner way of doing this in php, maybe a third-party library or something?

EDIT: The regex above may have errors - it's only meant as an example.

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

2条回答 默认 最新

  • 已采纳
    douchu4048 douchu4048 2013-01-29 14:59

    I wrote a function which extracts dates out of text by using strtotime():

    function parse_date_tokens($tokens) {
      # only try to extract a date if we have 2 or more tokens
      if(!is_array($tokens) || count($tokens) < 2) return false;
      return strtotime(implode(" ", $tokens));
    }
    
    function extract_dates($text) {
      static $patterns = Array(
        '/^[0-9]+(st|nd|rd|th|)?$/i', # day
        '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
        '/^20[0-9]{2}$/', # year
        '/^of$/' #words
      );
      # defines which of the above patterns aren't actually part of a date
      static $drop_patterns = Array(
        false,
        false,
        false,
        true
      );
      $tokens = Array();
      $result = Array();
      $text = str_word_count($text, 1, '0123456789'); # get all words in text
    
      # iterate words and search for matching patterns
      foreach($text as $word) {
        $found = false;
        foreach($patterns as $key => $pattern) {
          if(preg_match($pattern, $word)) {
            if(!$drop_patterns[$key]) {
              $tokens[] = $word;
            }
            $found = true;
            break;
          }
        }
    
        if(!$found) {
          $result[] = parse_date_tokens($tokens);
          $tokens = Array();
        }
      }
      $result[] = parse_date_tokens($tokens);
    
      return array_filter($result);
    }
    
    # test
    $texts = Array(
      "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
      "Valentines Special @ The Radisson, Feb 14th",
      "On Friday the 15th of February, a special Hollywood themed [...]",
      "Symposium on Childhood Play on Friday, February 8th",
      "Hosting a craft workshop March 9th - 11th in the old [...]"
    );
    
    $dates = extract_dates(implode(" ", $texts));
    echo "Dates: 
    ";
    foreach($dates as $date) {
      echo "  " . date('d.m.Y H:i:s', $date) . "
    ";
    }
    

    This outputs:

    Dates: 
      02.02.2013 00:00:00
      14.02.2013 00:00:00
      15.02.2013 00:00:00
      08.02.2013 00:00:00
      09.03.2013 00:00:00
    

    This solution may not be perfect and certainly has its flaws but it's a quite simple solution for your problem.

    点赞 4 评论 复制链接分享
  • dqhdz04240 dqhdz04240 2013-01-29 14:39

    For this kind of potentially complex regexes, I tend to break it down into simple pieces that can be individually unit-tested, maintained and evolved.

    I use REL, a DSL (in Scala) that allows you to reassemble and reuse your regex pieces. This way, you can define your regex like these Date matchers and unit test on each part.

    Also, your unit/spec tests can double as your doc for this bit of regex, indicating what is matched and what is not (which tends to be important with regexes).

    In the upcoming version of REL (0.3), you will be directly able to export the Regex in, say, PCRE (thus, PHP) flavor to use it independently… For now only JavaScript and .NET translations are implemented in the github repository. Using the latest (not yet publicly committed) snapshot, the PCRE flavor of the English alphanumeric date regex is this:

    /(?:(?:(?<!\d)(?<a_d1>(?>(?:(?:[23]?1)st|(?:2?2)nd|(?:2?3)rd|(?:[12]?[4-9]|[123]0)th)\b|0[1-9]|[12][0-9]|3[01]|[1-9]|[12][0-9]|3[01]))(?: ?+(?:of )?+))(?>(?<a_m1>jan(?>uary|\.)?|feb(?>ruary|r?\.?)?|mar(?>ch|\.)?|apr(?>il|\.)?|may|jun(?>e|\.)?|jul(?>y|\.)?|aug(?>ust|\.)?|sep(?>tember|t?\.?)?|oct(?>ober|\.)?|nov(?>ember|\.)?|dec(?>ember|\.)?))|(?:\b(?>(?<a_m2>jan(?>uary|\.)?|feb(?>ruary|r?\.?)?|mar(?>ch|\.)?|apr(?>il|\.)?|may|jun(?>e|\.)?|jul(?>y|\.)?|aug(?>ust|\.)?|sep(?>tember|t?\.?)?|oct(?>ober|\.)?|nov(?>ember|\.)?|dec(?>ember|\.)?)))(?:(?:(?: ?+)(?<a_d2>(?>(?:(?:[23]?1)st|(?:2?2)nd|(?:2?3)rd|(?:[12]?[4-9]|[123]0)th)\b|0[1-9]|[12][0-9]|3[01]|[1-9]|[12][0-9]|3[01]))(?!\d))?))(?:(?:,?+)(?:(?:(?: ?)(?<a_y>(?:1[7-9]|20)\d\d|'?+\d\d))(?!\d))|(?<=\b|\.))/i
    

    Obtained via expressing fr.splayce.rel.matchers.en.Date.ALPHA using PCREFlavor (not yet in the GitHub repository). It will only match when there is a month, expressed in alphabetic form (feb, feb. or february), the ….Date.ALL regex also matching numerical forms like 2/21/2013 is more complex.

    Also, this particular regex matches your examples but may still be a bit limited for your needs:

    • It does not include the week days
    • It will not match date ranges (matched only March 9th)
    • It will not match with year first e.g. 2013, jan. 14th
    点赞 6 评论 复制链接分享

相关推荐