duanchoupo1104 2013-06-27 08:58
浏览 93

清理从IMAP收集的电子邮件以插入数据库

I'm in the process of creating a script for our internal customer support system. I want to collect emails from our IMAP inbox (hosted on Gmail) and parse the emails into the database.

What is the best way to clean frames, badly coded tags, and messy formatting so the result is a clean text with minimal formatting?

I'm aware Regular Expressions will most likely play heavily, but I want to know if this functionality exists in another library somewhere that I'm missing.

Edit: More specifically what needs removed:

All inline CSS/Styling, All HTML except simple formatting like Bold, Underline, and Italics.

Here's an email I'm using as a test case, It's a fairly beefy spam email I got from ZoneAlarm, It's got a bit of everything.

<td>
                    <br>
                    <br>


                    <table align="center" bgcolor="#749FD0" border="0" cellpadding="0" cellspacing="0" style="font-family:Arial,Helvetica,sans-serif;font-size:12px;line-height:16px;color:#555555" valign="top" width="700">
                        <tbody>
                            <tr>
                                <td>

                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10">
                                                    <img border="0" height="1" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" style="display: block; max-width: 2880px;" width="1"></td>
                                            </tr>
                                        </tbody>
                                    </table>
                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10" width="10">
                                                    <img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/nw.png" style="display: block; max-width: 2880px;" width="10"></td>
                                                <td bgcolor="#E3ECEC" height="10" width="660">
                                                    <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2000&amp;&amp;&amp;http://www.zonealarm.com?cid=E200246" target="_blank"><img alt="ZoneAlarm by Check Point Software Technologies LTD." border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/za_transparent.png" width="120" style="display: block; max-width: 2880px;" title="ZoneAlarm by Check Point Software Technologies LTD."></a></td>
                                                <td align="right" style="font-family:Arial,Helvetica,sans-serif" width="150">
                                                    <span style="color:#999999;font-size:12px">Connect with ZoneAlarm</span></td>
                                                <td align="right" valign="middle" width="125">
                                                    <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2001&amp;&amp;&amp;http://www.facebook.com/ZoneAlarmFirewall" target="_blank"><img alt="ZoneAlarm Facebook" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/facebook.png" width="22" title="ZoneAlarm Facebook" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2002&amp;&amp;&amp;http://twitter.com/zonealarm" target="_blank"><img alt="ZoneAlarm Twitter" border="0" width="22" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/twitter.png" title="ZoneAlarm Twitter" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2003&amp;&amp;&amp;http://www.youtube.com/zonealarmsecurity" target="_blank"><img alt="ZoneAlarm YouTube" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/youtube.png" title="ZoneAlarm YouTube" height="22" style="max-width: 2880px;"></a><img border="0" height="15" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" width="10" style="max-width: 2880px;"></td>
                                                    <td bgcolor="#E3ECEC" rowspan="6" align="center" valign="top" width="1">
                                                <img align="right" height="32" src="http://download.zonealarm.com/bin/images/emails/welcome/borderx1.png" width="1" style="max-width: 2880px;">
                                                    </td>
                                            </tr>
                                        </tbody>
                                    </table>
                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10" width="10">
                                                    <img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/sw.png" style="display: block; max-width: 2880px;" width="10"></td>
                                                <td bgcolor="#E3ECEC" height="10" width="660">
  • 写回答

1条回答 默认 最新

  • dongwei4103 2013-06-27 09:11
    关注

    You can use HTML Purifier for this, see: http://htmlpurifier.org/

    评论

报告相同问题?

悬赏问题

  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度