dsxmwin86342 2017-06-08 17:00
浏览 431
已采纳

如何在Golang中仅从HTML中提取文本?

To extract text from HTML, I use a fully HTML5-compliant tokenizer and parser, like this

    s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

    domDocTest := html.NewTokenizer(strings.NewReader(s))
    for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
        if tokenType != html.TextToken {
            tokenType = domDocTest.Next()
            continue
        }
        TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
        if len(TxtContent) > 0 {
            fmt.Printf("%s
", TxtContent)
        }
        tokenType = domDocTest.Next()
    }

but I got this result

Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */

I don't want CDATA content. Some idea, how to get only the text content?

  • 写回答

2条回答 默认 最新

  • dphnn333971 2017-06-09 09:21
    关注

    As indicated by @Eric Pauley, I look at TextTokens & StartTagTokens. Here is my solution

        s := `
    <p>Links:</p><ul><li><a href="foo">Foo</a><li>
    <a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
    <script type='text/javascript'>
    /* <![CDATA[ */
    var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
    /* ]]> */
    </script>`
    
        domDocTest := html.NewTokenizer(strings.NewReader(s))
        previousStartTokenTest := domDocTest.Token()
    loopDomTest:
        for {
            tt := domDocTest.Next()
            switch {
            case tt == html.ErrorToken:
                break loopDomTest // End of the document,  done
            case tt == html.StartTagToken:
                previousStartTokenTest = domDocTest.Token()
            case tt == html.TextToken:
                if previousStartTokenTest.Data == "script" {
                    continue
                }
                TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
                if len(TxtContent) > 0 {
                    fmt.Printf("%s
    ", TxtContent)
                }
            }
        }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 Oracle中如何从clob类型截取特定字符串后面的字符
  • ¥15 想通过pywinauto自动电机应用程序按钮,但是找不到应用程序按钮信息
  • ¥15 MATLAB中streamslice问题
  • ¥15 如何在炒股软件中,爬到我想看的日k线
  • ¥15 51单片机中C语言怎么做到下面类似的功能的函数(相关搜索:c语言)
  • ¥15 seatunnel 怎么配置Elasticsearch
  • ¥15 PSCAD安装问题 ERROR: Visual Studio 2013, 2015, 2017 or 2019 is not found in the system.
  • ¥15 (标签-MATLAB|关键词-多址)
  • ¥15 关于#MATLAB#的问题,如何解决?(相关搜索:信噪比,系统容量)
  • ¥500 52810做蓝牙接受端