ds2321 2013-01-29 13:54
浏览 153
已采纳

使用Regex以多种格式捕获日期

I'm working on an app which scrapes local websites to create a database of upcoming events, and I'm trying to use Regex to catch as many formats of dates as possible.

Consider the following sentence fragments:

  • "The focus of the seminar, on Saturday 2nd February 2013 will be [...]"
  • "Valentines Special @ The Radisson, Feb 14th"
  • "On Friday the 15th of February, a special Hollywood themed [...]"
  • "Symposium on Childhood Play on Friday, February 8th"
  • "Hosting a craft workshop March 9th - 11th in the old [...]"

I want to be able to scan these and catch as many dates as possible. At the moment I'm doing this in what is probably a flawed way (I'm not great at regex), going through several regex statements one after the other, like this

/([0-9]+?)(st|nd|rd|th) (of)? (Jan|Feb|Mar|etc)/i
/([0-9]+?)(st|nd|rd|th) (of)? (January|February|March|Etcetera)/i
/(Jan|Feb|Mar|etc) ([0-9]+?)(st|nd|rd|th)/i
/(January|February|March|Etcetera) ([0-9]+?)(st|nd|rd|th)/i

I could merge these all into one giant regex statement, but it seems like there must be a cleaner way of doing this in php, maybe a third-party library or something?

EDIT: The regex above may have errors - it's only meant as an example.

  • 写回答

2条回答 默认 最新

  • douchu4048 2013-01-29 14:59
    关注

    I wrote a function which extracts dates out of text by using strtotime():

    function parse_date_tokens($tokens) {
      # only try to extract a date if we have 2 or more tokens
      if(!is_array($tokens) || count($tokens) < 2) return false;
      return strtotime(implode(" ", $tokens));
    }
    
    function extract_dates($text) {
      static $patterns = Array(
        '/^[0-9]+(st|nd|rd|th|)?$/i', # day
        '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
        '/^20[0-9]{2}$/', # year
        '/^of$/' #words
      );
      # defines which of the above patterns aren't actually part of a date
      static $drop_patterns = Array(
        false,
        false,
        false,
        true
      );
      $tokens = Array();
      $result = Array();
      $text = str_word_count($text, 1, '0123456789'); # get all words in text
    
      # iterate words and search for matching patterns
      foreach($text as $word) {
        $found = false;
        foreach($patterns as $key => $pattern) {
          if(preg_match($pattern, $word)) {
            if(!$drop_patterns[$key]) {
              $tokens[] = $word;
            }
            $found = true;
            break;
          }
        }
    
        if(!$found) {
          $result[] = parse_date_tokens($tokens);
          $tokens = Array();
        }
      }
      $result[] = parse_date_tokens($tokens);
    
      return array_filter($result);
    }
    
    # test
    $texts = Array(
      "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
      "Valentines Special @ The Radisson, Feb 14th",
      "On Friday the 15th of February, a special Hollywood themed [...]",
      "Symposium on Childhood Play on Friday, February 8th",
      "Hosting a craft workshop March 9th - 11th in the old [...]"
    );
    
    $dates = extract_dates(implode(" ", $texts));
    echo "Dates: 
    ";
    foreach($dates as $date) {
      echo "  " . date('d.m.Y H:i:s', $date) . "
    ";
    }
    

    This outputs:

    Dates: 
      02.02.2013 00:00:00
      14.02.2013 00:00:00
      15.02.2013 00:00:00
      08.02.2013 00:00:00
      09.03.2013 00:00:00
    

    This solution may not be perfect and certainly has its flaws but it's a quite simple solution for your problem.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 蓝牙耳机怎么查看日志
  • ¥15 Fluent齿轮搅油
  • ¥15 八爪鱼爬数据为什么自己停了
  • ¥15 交替优化波束形成和ris反射角使保密速率最大化
  • ¥15 树莓派与pix飞控通信
  • ¥15 自动转发微信群信息到另外一个微信群
  • ¥15 outlook无法配置成功
  • ¥30 这是哪个作者做的宝宝起名网站
  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏