dongtuo5262 2016-04-14 13:58
浏览 76
已采纳

使用正则表达式从数据库中提取数据(电子邮件主题行)

I'm hoping someone can help me get to the bottom of a problem I am having. I had a script put together about a year ago which parses incoming email and stores details in a database.

I get the email through with headers like so:

-------- Forwarded Message --------
Subject:    FS.G02 Fleet Street - j** associates (AG69)
Date:   Thu, 14 Apr 2016 11:27:32 +0000
From:   Stephanie Zo*****ou <Stephanie.Zo****ou@********.co.uk>
To:     'lucien@********.com' <lucien@********.com>

I use the following regex and PHP code to separate various pieces of data out ($text contains the above email string):

//Set RegEx to parse data out of text/plain email string
$re1 = '~(?<=From: )(.*?)(?: \<)(.*?)(?=\>)~';
$re2 = "~(?<=To: ').*(?=')~";
$re3 = "~(?<=Sent: ).*(?=)~";
$re4 = "~(?<=Subject: ).*(?=)~"; 
$re5 = "~(?<=Subject:\s)(.*?)(?=\s)(?:.*\s\-\s)(.*)~";
$re6 = "~\((.*?)\)~";

//Pull the data out using above expressions
if(preg_match($re1, $text, $matches1)) {
    $from_name = $matches1[1];
    $from_email = $matches1[2];
}
if(preg_match($re2, $text, $matches2))
    $to_email = $matches2[0];

if(preg_match($re3, $text, $matches3))
    $sent_date = $matches3[0];

if(preg_match($re4, $text, $matches4))
    $subject_line = $matches4[0];

if(preg_match($re5, $text, $matches5)) {
    $unit_code = $matches5[1];
    $company_name = $matches5[2];   
}

//Change sent date to timestamp
$sent_date = strtotime($sent_date);

//break the unit code and building code apart
$unit_code = explode('.',$unit_code,2);
$building_code = $unit_code[0];
$unit_code = $unit_code[1];
//break the (C0D3) off the end of the company  / subject line
$company_name = preg_replace($re6,'' ,$company_name);

The data I am trying to separate so that I can store in the DB are:

  1. The email address after 'To:'
  2. The time/date string after 'Date:'
  3. The subject line

My problem is that the script has stopped working properly. My RegEx isn't giving me the timestamp, nor is it breaking down the subject line in to it's component parts:

FS.G02 Fleet Street - j** associates (AG69)

The code at the beginning is one piece of data I need. I then break it up in to the first two letters, and then the resulting alphanumerical second half.

FS.G02 Fleet Street - j associates** (AG69)

The second part I need is always after the hyphen - it's a company / customer name.

The format of this hasn't change since I last got it working so I can't tell if I have broken the RegEx. Is anyone who has a little more experience than I with RegEx able to see where I am going wrong?

Many thanks, Jonathan

  • 写回答

1条回答 默认 最新

  • dongya1875 2016-04-14 14:12
    关注

    Have you tried using imap_rfc822_parse_headers() (Docs) instead of using a regex? It would certainly make it a lot simpler.

    EDIT: Realised the docs don't actually say a lot about the function. Here's a sample output, called on your data there:

    object(stdClass)#1 (12) {
        ["date"]=> string(31) "Thu, 14 Apr 2016 11:27:32 +0000" 
        ["Date"]=> string(31) "Thu, 14 Apr 2016 11:27:32 +0000" 
        ["subject"]=> string(43) "FS.G02 Fleet Street - j** associates (AG69)"
        ["Subject"]=> string(43) "FS.G02 Fleet Street - j** associates (AG69)"
        ["toaddress"]=> string(69) "'lucien@********.com', UNEXPECTED_DATA_AFTER_ADDRESS@".SYNTAX-ERROR."" 
        ["to"]=> array(2) {
            [0]=> object(stdClass)#2 (2) {
                ["mailbox"]=> string(7) "'lucien" 
                ["host"]=> string(13) "********.com'" 
            }
            [1]=> object(stdClass)#3 (2) { 
                ["mailbox"]=> string(29) "UNEXPECTED_DATA_AFTER_ADDRESS"
                ["host"]=> string(14) ".SYNTAX-ERROR." 
            }
        }
        ["fromaddress"]=> string(55) "Stephanie Zo*****ou " 
        ["from"]=> array(1) {
            [0]=> object(stdClass)#4 (3) {
                ["personal"]=> string(19) "Stephanie Zo*****ou"  
                ["mailbox"]=> string(18) "Stephanie.Zo****ou"
                ["host"]=> string(14) "********.co.uk"
            }
        }
        ["reply_toaddress"]=> string(55) "Stephanie Zo*****ou "
        ["reply_to"]=> array(1) {
            [0]=> object(stdClass)#5 (3) {
                ["personal"]=> string(19) "Stephanie Zo*****ou"
                ["mailbox"]=> string(18) "Stephanie.Zo****ou"
                ["host"]=> string(14) "********.co.uk"
            }
        }
        ["senderaddress"]=> string(55) "Stephanie Zo*****ou "
        ["sender"]=> array(1) {
            [0]=> object(stdClass)#6 (3) {
                ["personal"]=> string(19) "Stephanie Zo*****ou"
                ["mailbox"]=> string(18) "Stephanie.Zo****ou"
                ["host"]=> string(14) "********.co.uk" 
            }
        }
     }
    

    Here's a regex for your subject line as well:

    ([A-Z0-9]*\.[A-Z0-9]*)\s([A-Za-z\s]*)\s-\s([A-Za-z\s]*)\s(\([A-Z0-9]*\))
    

    When called with preg_match(), like:

    $output = [];
    $input = "FS.G02 Fleet Street - Something associates (AG69)";
    preg_match("/([A-Z0-9]*\.[A-Z0-9]*)\s([A-Za-z\s]*)\s-\s([A-Za-z\s]*)\s(\([A-Z0-9]*\))/", $input, $output);
    

    You will receive something like:

    array(
        0   =>  "FS.G02 Fleet Street - Something associates (AG69)",
        1   =>  "FS.G02",
        2   =>  "Fleet Street",
        3   =>  "Something associates",
        4   =>  "(AG69)"
    )
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 应该如何判断含间隙的曲柄摇杆机构,轴与轴承是否发生了碰撞?
  • ¥15 vue3+express部署到nginx
  • ¥20 搭建pt1000三线制高精度测温电路
  • ¥15 使用Jdk8自带的算法,和Jdk11自带的加密结果会一样吗,不一样的话有什么解决方案,Jdk不能升级的情况
  • ¥15 画两个图 python或R
  • ¥15 在线请求openmv与pixhawk 实现实时目标跟踪的具体通讯方法
  • ¥15 八路抢答器设计出现故障
  • ¥15 opencv 无法读取视频
  • ¥15 按键修改电子时钟,C51单片机
  • ¥60 Java中实现如何实现张量类,并用于图像处理(不运用其他科学计算库和图像处理库))