使用Regex以多种格式捕获日期

我正在开发一个应用程序,用于搜索本地网站以创建即将发生的事件的数据库,我正在尝试 使用正则表达式来捕获尽可能多的日期格式。 </ p>

考虑以下句子片段:</ p>


  • “2013年2月2日星期六研讨会的焦点将是[.. 。]“</ li>
  • ”情人节特别节目@ The Radisson,2月14日“</ li>
  • ”2月15日星期五,特别好莱坞主题[...]“< / li>
  • “2月8日星期五儿童游戏研讨会”</ li>
  • “3月9日 - 11日在老式[...]举办手工作坊”</ li>

    </ ul>

    我希望能够扫描这些并捕获尽可能多的日期。 目前我正在做这个可能是一个有缺陷的方式(我在正则表达式上不是很好),一个接一个地经历几个正则表达式语句,比如这个</ p>

      /([0-9] +?)(st | nd | rd | th)(of)?  (Jan | Feb | Mar | etc)/ i 
    /([0-9] +?)(st | nd | rd | th)(of)? (1月| 2月| 3月| Etcetera)/ i
    /(Jan | Feb | Mar | etc)([0-9] +?)(st | nd | rd | th)/ i
    /(1月| 2月) | March | Etcetera)([0-9] +?)(st | nd | rd | th)/ i
    </ code> </ pre>

    我可以将这些全部合并为一个巨人 正则表达式声明,但似乎必须有一个更简洁的方式在PHP中执行此操作,可能是第三方库或其他什么? </ p>

    编辑:上面的正则表达式可能有错误 - 它只是作为一个例子。 </ p>
    </ div>

展开原文

原文

I'm working on an app which scrapes local websites to create a database of upcoming events, and I'm trying to use Regex to catch as many formats of dates as possible.

Consider the following sentence fragments:

  • "The focus of the seminar, on Saturday 2nd February 2013 will be [...]"
  • "Valentines Special @ The Radisson, Feb 14th"
  • "On Friday the 15th of February, a special Hollywood themed [...]"
  • "Symposium on Childhood Play on Friday, February 8th"
  • "Hosting a craft workshop March 9th - 11th in the old [...]"

I want to be able to scan these and catch as many dates as possible. At the moment I'm doing this in what is probably a flawed way (I'm not great at regex), going through several regex statements one after the other, like this

/([0-9]+?)(st|nd|rd|th) (of)? (Jan|Feb|Mar|etc)/i
/([0-9]+?)(st|nd|rd|th) (of)? (January|February|March|Etcetera)/i
/(Jan|Feb|Mar|etc) ([0-9]+?)(st|nd|rd|th)/i
/(January|February|March|Etcetera) ([0-9]+?)(st|nd|rd|th)/i

I could merge these all into one giant regex statement, but it seems like there must be a cleaner way of doing this in php, maybe a third-party library or something?

EDIT: The regex above may have errors - it's only meant as an example.

doushuo8677
doushuo8677 刚刚做了,谢谢!
7 年多之前 回复
ds3464
ds3464 我想知道A)所需的四个+表达式是否可以巧妙地合并到一个表达式中,或者B)如果是一个可以做到的库或函数,就像下面建议的那样
7 年多之前 回复
dongyou6847
dongyou6847 这些函数可能很有用DateTime::createFromFormat,date_parse,strftime,strtotime
7 年多之前 回复
dongwuli5105
dongwuli5105 究竟是什么问题?你在寻找一个相当于你已经拥有的四个正则表达式吗?有什么理由你不能使用多个?
7 年多之前 回复

2个回答



我写了一个函数,通过使用 strtotime()</ code> :</ p>

  function parse_date_tokens($ tokens){
#仅尝试提取日期 如果我们有2个或更多令牌
if(!is_array($ tokens)|| count($ tokens)&lt; 2)return false;
return strtotime(implode(“”,$ tokens));
} \ n
function extract_dates($ text){
static $ patterns = Array(
'/ ^ [0-9] +(st | nd | rd | th |)?$ / i',#day
'/ ^(Jan(uary)?| Feb(ruary)?| Mar(ch)?| etc)$ / i',#month
'/ ^ 20 [0-9] {2} $ /',#year \ $ /'#words
)的n'/ ^;
#定义上述哪些模式实际上不是日期的一部分
静态$ drop_patterns = Array(
false,
false,
false ,
true
);
$ tokens = Array();
$ result = Array();
$ text = str_word_count($ text,1,'0123456789'); #获取文本中的所有单词

#迭代单词并搜索匹配的模式
foreach($ text as $ word){
$ found = false;
foreach($ patterns as $ key =&gt; $ pattern ){
if(preg_match($ pattern,$ word)){
if(!$ drop_patterns [$ key]){
$ tokens [] = $ word;
}
$ found = true; \ n break;
}
}

if if(!$ found){
$ result [] = parse_date_tokens($ tokens);
$ tokens = Array();
}
}
$ result [] = parse_date_tokens($ tokens);

返回array_filter($ result);
}

#test
$ texts = Array(
“研讨会的焦点,周六2日 2013年2月将是“,
”情人节特别节目@ The Radisson,2月14日“,
”2月15日星期五,特别好莱坞主题[...]“,
”研讨会 2月8日星期五的童年游戏“,
”3月9日 - 11日在旧的[...]“
)中举办手工艺作坊;

$ dates = extract_dates(implode(”“,$ texts) );
echo“日期:
”;
foreach($ date as $ date){
echo“”。 date('d.m.Y H:i:s',$ date)。 “
”;
}
</ code> </ pre>

此输出:</ p>

 日期:
02.02.2013 00 :00:00
14.02.2013 00:00:00
15.02.2013 00:00:00
08.02.2013 00:00:00
09.03.2013 00:00:00
</ code> </ pre>

此解决方案可能并不完美,当然也有其缺陷,但对于您的问题来说这是一个非常简单的解决方案。</ p>
</ div>

展开原文

原文

I wrote a function which extracts dates out of text by using strtotime():

function parse_date_tokens($tokens) {
  # only try to extract a date if we have 2 or more tokens
  if(!is_array($tokens) || count($tokens) < 2) return false;
  return strtotime(implode(" ", $tokens));
}

function extract_dates($text) {
  static $patterns = Array(
    '/^[0-9]+(st|nd|rd|th|)?$/i', # day
    '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
    '/^20[0-9]{2}$/', # year
    '/^of$/' #words
  );
  # defines which of the above patterns aren't actually part of a date
  static $drop_patterns = Array(
    false,
    false,
    false,
    true
  );
  $tokens = Array();
  $result = Array();
  $text = str_word_count($text, 1, '0123456789'); # get all words in text

  # iterate words and search for matching patterns
  foreach($text as $word) {
    $found = false;
    foreach($patterns as $key => $pattern) {
      if(preg_match($pattern, $word)) {
        if(!$drop_patterns[$key]) {
          $tokens[] = $word;
        }
        $found = true;
        break;
      }
    }

    if(!$found) {
      $result[] = parse_date_tokens($tokens);
      $tokens = Array();
    }
  }
  $result[] = parse_date_tokens($tokens);

  return array_filter($result);
}

# test
$texts = Array(
  "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
  "Valentines Special @ The Radisson, Feb 14th",
  "On Friday the 15th of February, a special Hollywood themed [...]",
  "Symposium on Childhood Play on Friday, February 8th",
  "Hosting a craft workshop March 9th - 11th in the old [...]"
);

$dates = extract_dates(implode(" ", $texts));
echo "Dates: 
";
foreach($dates as $date) {
  echo "  " . date('d.m.Y H:i:s', $date) . "
";
}

This outputs:

Dates: 
  02.02.2013 00:00:00
  14.02.2013 00:00:00
  15.02.2013 00:00:00
  08.02.2013 00:00:00
  09.03.2013 00:00:00

This solution may not be perfect and certainly has its flaws but it's a quite simple solution for your problem.

dsiv4041
dsiv4041 谢谢,我没注意到。 我修好了它。
7 年多之前 回复
dtt3399
dtt3399 所以它错过了“二月十五日”,因为'of'?
7 年多之前 回复

For this kind of potentially complex regexes, I tend to break it down into simple pieces that can be individually unit-tested, maintained and evolved.

I use REL, a DSL (in Scala) that allows you to reassemble and reuse your regex pieces. This way, you can define your regex like these Date matchers and unit test on each part.

Also, your unit/spec tests can double as your doc for this bit of regex, indicating what is matched and what is not (which tends to be important with regexes).

In the upcoming version of REL (0.3), you will be directly able to export the Regex in, say, PCRE (thus, PHP) flavor to use it independently… For now only JavaScript and .NET translations are implemented in the github repository. Using the latest (not yet publicly committed) snapshot, the PCRE flavor of the English alphanumeric date regex is this:

/(?:(?:(?<!\d)(?<a_d1>(?>(?:(?:[23]?1)st|(?:2?2)nd|(?:2?3)rd|(?:[12]?[4-9]|[123]0)th)\b|0[1-9]|[12][0-9]|3[01]|[1-9]|[12][0-9]|3[01]))(?: ?+(?:of )?+))(?>(?<a_m1>jan(?>uary|\.)?|feb(?>ruary|r?\.?)?|mar(?>ch|\.)?|apr(?>il|\.)?|may|jun(?>e|\.)?|jul(?>y|\.)?|aug(?>ust|\.)?|sep(?>tember|t?\.?)?|oct(?>ober|\.)?|nov(?>ember|\.)?|dec(?>ember|\.)?))|(?:\b(?>(?<a_m2>jan(?>uary|\.)?|feb(?>ruary|r?\.?)?|mar(?>ch|\.)?|apr(?>il|\.)?|may|jun(?>e|\.)?|jul(?>y|\.)?|aug(?>ust|\.)?|sep(?>tember|t?\.?)?|oct(?>ober|\.)?|nov(?>ember|\.)?|dec(?>ember|\.)?)))(?:(?:(?: ?+)(?<a_d2>(?>(?:(?:[23]?1)st|(?:2?2)nd|(?:2?3)rd|(?:[12]?[4-9]|[123]0)th)\b|0[1-9]|[12][0-9]|3[01]|[1-9]|[12][0-9]|3[01]))(?!\d))?))(?:(?:,?+)(?:(?:(?: ?)(?<a_y>(?:1[7-9]|20)\d\d|'?+\d\d))(?!\d))|(?<=\b|\.))/i

Obtained via expressing fr.splayce.rel.matchers.en.Date.ALPHA using PCREFlavor (not yet in the GitHub repository). It will only match when there is a month, expressed in alphabetic form (feb, feb. or february), the ….Date.ALL regex also matching numerical forms like 2/21/2013 is more complex.

Also, this particular regex matches your examples but may still be a bit limited for your needs:

  • It does not include the week days
  • It will not match date ranges (matched only March 9th)
  • It will not match with year first e.g. 2013, jan. 14th
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问
相关内容推荐