dongqinta4174 2018-09-06 22:05
浏览 224
已采纳

使用gocolly抓取时如何在html表格单元格中保留换行符

I'm trying to preserve the formatting
in table cells when I extract the contents of a <td> cell.

What happens is if there are two lines of text (for e.g, an address) in the , the code may look like: <td> address line1<br>1 address line2</td>

When colly extracts this, I get the following: address line1address line2

with no spacing or line breaks since all the html has been stripped from the text.

How can I work around / fix this so I receive readable text from the <td>

  • 写回答

2条回答 默认 最新

  • dpi9530 2018-09-09 05:48
    关注

    As far as I know gocolly does not support such formatting, but you can basically do something like below, by using htmlquery(which gocolly uses it internally) package's OutputHTML method

    const htmlPage = `
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
      <head>
        <title>Your page title here</title>
      </head>
      <body>
        <p>
        AddressLine 1 
        <br>
        AddresLine 2
        </p>
      </body>
    </html>
    `
    
    doc, _ := htmlquery.Parse(strings.NewReader(htmlPage))
    xmlNode := htmlquery.FindOne(doc, "//p")
    result := htmlquery.OutputHTML(xmlNode, false)
    

    output of result variable is like below now:

     AddressLine 1
       <br/>
     AddresLine 2
    

    You can now parse result by <br/> tag and achive what you want.

    But I am also new in go, so maybe there may be better way to do it.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)
编辑
预览

报告相同问题?