dongqinta4174 2018-09-06 22:05

浏览 224

已采纳

使用gocolly抓取时如何在html表格单元格中保留换行符

I'm trying to preserve the formatting
in table cells when I extract the contents of a <td> cell.

What happens is if there are two lines of text (for e.g, an address) in the , the code may look like: <td> address line1<br>1 address line2</td>

When colly extracts this, I get the following: address line1address line2

with no spacing or line breaks since all the html has been stripped from the text.

How can I work around / fix this so I receive readable text from the <td>

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dpi9530 2018-09-09 05:48
关注
As far as I know gocolly does not support such formatting, but you can basically do something like below, by using htmlquery(which gocolly uses it internally) package's OutputHTML method

const htmlPage = ` <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head> <title>Your page title here</title> </head> <body> <p> AddressLine 1 <br> AddresLine 2 </p> </body> </html> ` doc, _ := htmlquery.Parse(strings.NewReader(htmlPage)) xmlNode := htmlquery.FindOne(doc, "//p") result := htmlquery.OutputHTML(xmlNode, false)

output of result variable is like below now:

AddressLine 1 <br/> AddresLine 2

You can now parse result by <br/> tag and achive what you want.

But I am also new in go, so maybe there may be better way to do it.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报
编辑

预览
轻敲空格完成输入
显示为

卡片

标题

链接
评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

编辑

预览

报告相同问题？

关注问题