doucao8982 2013-08-22 11:23
浏览 43
已采纳

可靠地清理电子邮件正文编码

I am writing a small piece of software in php which connects to a IMAP email box and stores the messages contained therein in a MySQL DB for later processing and other goodness.

I have noticed that during testing I get some strange characters appearing in the message body when I attempt to save the message body raw. I am using imap_fetchbody() to extract the message body.

I noticed that when I use quoted_printable_decode() to clean up the message body this helps! However in doing lots of research I have also learned that this will not always help and that other methods such as utf8_encode() and base64_decode() should be used instead to clean up the message body.

So, my question is: what is the best method for reliably cleaning an email message body with php to cover all encoding scenarios?

  • 写回答

1条回答 默认 最新

  • douning5041 2013-08-23 09:35
    关注

    An "email body" is nowadays actually a tree of individual MIME parts. Sometimes there's just one of them, e.g. a text/plain mail. Sometimes there's a multipart/alternative which wraps inside it two "equivalent" copies of the message, one as text/plain and other as text/html. Sometimes the structure is much more complicated, with many levels of nesting. It is quite common that some of these parts are actually binary content, like images, attached ZIP files and what not.

    Each of these individual MIME parts can be encoded for transport; these are specified in the Content-Transfer-Encoding header of the corresponding MIME part. The two encoding schemes which you absolutely must support to interoperate are quoted-printable and base64. An important observation is that this encoding happens separately for each part, i.e. it's perfectly legal to have a multipart/alternative with a text/plain encoded with quoted-printable and another part, text/html encoded in base64.

    When you have decoded this transfer encoding, you still have to decode the text from its character encoding to Unicode, i.e. to turn the stream of bytes into Unicode text. You need to consult the encoding parameter of the Content-Type MIME header (again, the part header, not the whole-message header, unless the message itself has only one part).

    All details you need to know are in RFC 2045, RFC 2046, RFC 2047 and RFC 2048 (and their corresponding updates).

    FInally, there's also the interesting question on what the "main part" of an e-mail is. Suppose you have something like this:

    1 multipart/mixed
      + 1.1 text/plain: "Hi, I'm forwarding Jeff's message..."
      + 1.2 message/rfc822
        + 1.2.1 multipart/alternative
           + 1.2.1.1 text/plain "Hi coleagues, I'm sending the meeting notes from..."
           + 1.2.1.2 text/html "<p>Hi colleagues,..."
    

    i.e. this happens when Fred forwards Jeff's message to you. What is the "main part" here?

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 有偿 写代码 要用特定的软件anaconda 里的jvpyter 用python3写
  • ¥20 cad图纸,chx-3六轴码垛机器人
  • ¥15 移动摄像头专网需要解vlan
  • ¥20 access多表提取相同字段数据并合并
  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角
  • ¥20 Java-Oj-桌布的计算
  • ¥15 powerbuilder中的datawindow数据整合到新的DataWindow
  • ¥20 有人知道这种图怎么画吗?
  • ¥15 pyqt6如何引用qrc文件加载里面的的资源
  • ¥15 安卓JNI项目使用lua上的问题