dongqinta4174 2018-09-07 06:05
浏览 224
已采纳

使用gocolly抓取时如何在html表格单元格中保留换行符

I'm trying to preserve the formatting
in table cells when I extract the contents of a <td> cell.

What happens is if there are two lines of text (for e.g, an address) in the , the code may look like: <td> address line1<br>1 address line2</td>

When colly extracts this, I get the following: address line1address line2

with no spacing or line breaks since all the html has been stripped from the text.

How can I work around / fix this so I receive readable text from the <td>

  • 写回答

2条回答 默认 最新

  • dpi9530 2018-09-09 13:48
    关注

    As far as I know gocolly does not support such formatting, but you can basically do something like below, by using htmlquery(which gocolly uses it internally) package's OutputHTML method

    const htmlPage = `
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
      <head>
        <title>Your page title here</title>
      </head>
      <body>
        <p>
        AddressLine 1 
        <br>
        AddresLine 2
        </p>
      </body>
    </html>
    `
    
    doc, _ := htmlquery.Parse(strings.NewReader(htmlPage))
    xmlNode := htmlquery.FindOne(doc, "//p")
    result := htmlquery.OutputHTML(xmlNode, false)
    

    output of result variable is like below now:

     AddressLine 1
       <br/>
     AddresLine 2
    

    You can now parse result by <br/> tag and achive what you want.

    But I am also new in go, so maybe there may be better way to do it.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 pnpm 下载element-plus
  • ¥15 解决编写PyDracula时遇到的问题
  • ¥15 有没有人能解决下这个问题吗,本人不会编程
  • ¥15 plotBAPC画图出错
  • ¥30 关于#opencv#的问题:使用大疆无人机拍摄水稻田间图像,拼接成tif图片,用什么方法可以识别并框选出水稻作物行
  • ¥15 Python卡尔曼滤波融合
  • ¥20 iOS绕地区网络检测
  • ¥15 python验证码滑块图像识别
  • ¥15 根据背景及设计要求撰写设计报告
  • ¥20 能提供一下思路或者代码吗