weixin_39546092
weixin_39546092
2021-01-08 18:39

Replace Treetop parser with a Ragel based parser

This pull request replaces the Treetop based parser with a Ragel-based parser. This change is primarily to improve the performance of message processing. Compared to the Treetop parser the Ragel version is ~7.5x faster.

The new parser exhibits the same behavior as the current parser except for a couple case where the Treetop parser was incorrectly handling fields. I've submitted PRs to fix both issues: PR #487 PR #481. A couple specs related to these issues are marked pending in the current PR. Assuming these other two PRs are merged I will rebase and remove the pending lines from those specs.

The change in parsers necessitates removing a public interface. It is already deprecated but still may require a major version bump.

I know this is a massive change, let me know if there is anything I can do to make it more digestible.

Benchmark

Parsing a set of 1000 emails from the enron data set:


Mail-2.5.3: 24.78s (40.35 emails/second)
Mail-2.5.3 w/Ragel Parser: 3.245290s (308.6 emails/second)

Parser layout


lib/mail/parsers/
  address_lists_parser.rb        # Build data Structs for the Elements (lib/mail/elements)
  content_disposition_parser.rb  # by interpreting actions emitted by the state machine modules.
  ...
  ragel/
    common.rl # Main grammar definition
    ruby/
      machines/
        address_lists_machine.rb        # Ragel state machines which emit events that
        content_disposition_machine.rb  # are consumed by the higher level parsers
        ...

Further Work

I am also working on a native parser (based on the same ragel grammar) that will improve performance even further. It uses an FFI interface to a custom shared module so that gains can be shared with Rubinius and JRuby. This is one of the advantages of using a Ragel-based parser and why there is a strict separation between the state machine modules and the classes that interpret the actions.

该提问来源于开源项目:mikel/mail

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

17条回答

  • weixin_39748928 weixin_39748928 4月前

    This is awesome. what do you say?

    点赞 评论 复制链接分享
  • weixin_39975683 weixin_39975683 4月前

    Wow the benchmark is promising! And keeping the same Ragel grammar for a native parser is a good move.

    点赞 评论 复制链接分享
  • weixin_39881575 weixin_39881575 4月前

    Great patch. Tested in two apps and runs fine!

    Pushing the parsing machinery out to arms' length cleans up the field classes nicely.

    点赞 评论 复制链接分享
  • weixin_39881575 weixin_39881575 4月前

    Nice speedup in the test suite as well!

    master:

    
    Finished in 19.91 seconds
    1455 examples, 0 failures, 9 pending
    

    ragel:

    
    Finished in 6.55 seconds
    1407 examples, 0 failures, 10 pending
    
    点赞 评论 复制链接分享
  • weixin_39881575 weixin_39881575 4月前

    could you rebase on latest master? Here's a rundown of the .treetop changes:

     patch
    diff --git a/lib/mail/parsers/content_transfer_encoding.treetop b/lib/mail/parsers/content_transfer_encoding.treetop
    index 9d0f50a..9db6134 100644
    --- a/lib/mail/parsers/content_transfer_encoding.treetop
    +++ b/lib/mail/parsers/content_transfer_encoding.treetop
    @@ -9,12 +9,10 @@ module Mail
         end
    
         rule encoding
    -      ietf_token "s"? {
    -        def text_value
    -          ietf_token.text_value
    -        end
    -      } / custom_x_token
    +      "7bits" / "8bits" /
    +      "7bit" / "8bit" / "binary" / "quoted-printable" / "base64" /
    +      ietf_token / custom_x_token
         end
    
       end
    -end
    \ No newline at end of file
    +end
    diff --git a/lib/mail/parsers/content_type.treetop b/lib/mail/parsers/content_type.treetop
    index 86fe64b..84eeced 100644
    --- a/lib/mail/parsers/content_type.treetop
    +++ b/lib/mail/parsers/content_type.treetop
    @@ -5,7 +5,7 @@ module Mail
         include RFC2045
    
         rule content_type
    -      main_type "/" sub_type param_hashes:(CFWS ";"? parameter CFWS)* {
    +      main_type "/" sub_type param_hashes:(CFWS ";"* parameter CFWS)* {
             def parameters
               param_hashes.elements.map do |param|
                 param.parameter.param_hash
    @@ -65,4 +65,4 @@ module Mail
         end
    
       end
    -end
    \ No newline at end of file
    +end
    diff --git a/lib/mail/parsers/rfc2045.treetop b/lib/mail/parsers/rfc2045.treetop
    index c166492..2839e73 100644
    --- a/lib/mail/parsers/rfc2045.treetop
    +++ b/lib/mail/parsers/rfc2045.treetop
    @@ -8,8 +8,7 @@ module Mail
         end
    
         rule ietf_token
    -      "7bit" / "8bit" / "binary" /
    -      "quoted-printable" / "base64"
    +      token+
         end
    
         rule custom_x_token
    diff --git a/lib/mail/parsers/rfc2822.treetop b/lib/mail/parsers/rfc2822.treetop
    index fc437f6..77dc3d6 100644
    --- a/lib/mail/parsers/rfc2822.treetop
    +++ b/lib/mail/parsers/rfc2822.treetop
    @@ -184,7 +184,7 @@ module Mail
         end
    
         rule quoted_string
    -      CFWS? DQUOTE quoted_content:(FWS? qcontent)+ FWS? DQUOTE CFWS?
    +      CFWS? DQUOTE quoted_content:(FWS? qcontent)* FWS? DQUOTE CFWS?
         end
    
         rule qcontent
    @@ -222,7 +222,22 @@ module Mail
         end
    
         rule mailbox
    -      name_addr / addr_spec
    +      (name_addr / addr_spec) {
    +        def dig_comments(comments, elements)
    +          elements.each { |elem|
    +            if elem.respond_to?(:comment)
    +              comments << elem.comment
    +            end
    +            dig_comments(comments, elem.elements) if elem.elements
    +           }
    +        end
    +
    +        def comments
    +          comments = []
    +          dig_comments(comments, elements)
    +          comments
    +        end
    +      }
         end
    
         rule address
    @@ -244,24 +259,7 @@ module Mail
             end
    
           } /
    -      mailbox {
    -
    -      def dig_comments(comments, elements)
    -        elements.each { |elem|
    -          if elem.respond_to?(:comment)
    -            comments << elem.comment
    -          end
    -          dig_comments(comments, elem.elements) if elem.elements
    -         }
    -      end
    -
    -      def comments
    -        comments = []
    -        dig_comments(comments, elements)
    -        comments
    -      end
    -
    -      }
    +      mailbox
         end
    
         rule address_list
    @@ -340,7 +338,7 @@ module Mail
         end
    
         rule name_val_list
    -      (CFWS)? (name_val_pair (CFWS name_val_pair)*)
    +      (CFWS)? (name_val_pair (CFWS name_val_pair)*)?
         end
    
         rule name_val_pair
    
    点赞 评论 复制链接分享
  • weixin_39546092 weixin_39546092 4月前

    Rebased!

    点赞 评论 复制链接分享
  • weixin_40009472 weixin_40009472 4月前

    This PR is awesome. I have an email here with 612 recipients and 1152 Ccs (someone really fails at email :). It takes the time for Mail.new(str).to_s from 20.5s to 1.0s

    点赞 评论 复制链接分享
  • weixin_40009472 weixin_40009472 4月前

    : I'm definitely "doing it wrong", but this seems like unexpected behaviour:

     ruby
    h = Mail::Header.new; h['From'] = "Conrad Irwin <me.in> "; h['From'].addresses
    # => ["me.in", "me"]
    </me.in>

    Without the trailing space, or using the old treetop parser, I get the expected ["me.in"].

    点赞 评论 复制链接分享
  • weixin_39546092 weixin_39546092 4月前

    interesting, I'll look at that -- my goal is for the two parsers to be compatible as possible within reason.

    点赞 评论 复制链接分享
  • weixin_39546092 weixin_39546092 4月前

    Rebased against latest master and fixed the issue found.

    点赞 评论 复制链接分享
  • weixin_40009472 weixin_40009472 4月前

    Awesome, thanks! We've been running your code since Friday, and all seems to be going well so far :). (I guess we're parsing about a thousand emails a day through it at the moment, so not a huge amount, but definitely confidence inspiring)

    点赞 评论 复制链接分享
  • weixin_39610759 weixin_39610759 4月前

    can you check this?

    I would love to ship next Rails 4 beta with mail 2.6.0

    点赞 评论 复制链接分享
  • weixin_39881575 weixin_39881575 4月前

    Working well for me. Next major release will be a while though. Next minor, maybe.

    点赞 评论 复制链接分享
  • weixin_39953845 weixin_39953845 4月前

    Sweet. What more work needs to be done to get this merged?

    点赞 评论 复制链接分享
  • weixin_39881575 weixin_39881575 4月前

    Slated for merge to master after next minor version release. Please do test it out in your own apps!

    点赞 评论 复制链接分享
  • weixin_39546092 weixin_39546092 4月前
    • Updated the funky grammar that pointed out.
    • Added a fix for email addresses that begin with a comment -- was triggering an exception.
    • Rebased on current master.
    点赞 评论 复制链接分享
  • weixin_39933336 weixin_39933336 4月前

    This is great work, I'll be doing some updates to get a minor out then we'll merge this into the next major release.

    点赞 评论 复制链接分享

相关推荐