duanqinqian5299 2016-03-12 18:14
浏览 79
已采纳

如何使用golang将HTML表转换为数组

I'm having a problem trying to convert an HTML table into a Golang array. I've tried to achieve it using x/net/html and goquery, without any success on both of them.

Let's say we have this HTML table:

<html>
  <body>
    <table>
      <tr>
        <td>Row 1, Content 1<td>
        <td>Row 1, Content 2<td>
        <td>Row 1, Content 3<td>
        <td>Row 1, Content 4<td>
      </tr>
      <tr>
        <td>Row 2, Content 1<td>
        <td>Row 2, Content 2<td>
        <td>Row 2, Content 3<td>
        <td>Row 2, Content 4<td>
      </tr>
    </table>
  </body>
</html>

And I'd like to end up with this array:

------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------

As you guy can see, I'm just ignoring Contents 3 and 4.

My extraction code:

func extractValue(content []byte) {
  doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))

  doc.Find("table tr td").Each(func(i int, td *goquery.Selection) {
    // ...
  })
}

I've tried to add a controller number which would be responsible for ignoring the <td> that I don't want to convert and calling

td.NextAll()

but with no luck. Do you guys have any idea of what should I do to accomplish it?

Thanks.

  • 写回答

2条回答 默认 最新

  • doujiazong0322 2016-03-12 22:38
    关注

    You can get away with package golang.org/x/net/html only.

    var body = strings.NewReader(`                                                                                                                            
            <html>                                                                                                                                            
            <body>                                                                                                                                            
            <table>                                                                                                                                           
            <tr>                                                                                                                                              
            <td>Row 1, Content 1<td>                                                                                                                          
            <td>Row 1, Content 2<td>                                                                                                                          
            <td>Row 1, Content 3<td>                                                                                                                          
            <td>Row 1, Content 4<td>                                                                                                                          
            </tr>                                                                                                                                             
            <tr>                                                                                                                                              
            <td>Row 2, Content 1<td>                                                                                                        
            <td>Row 2, Content 2<td>                                                                                                                          
            <td>Row 2, Content 3<td>                                                                                                                          
            <td>Row 2, Content 4<td>                                                                                                                          
            </tr>  
            </table>                                                                                                                                          
            </body>                                                                                                                                           
            </html>`)          
    
    func main() {
        z := html.NewTokenizer(body)
        content := []string{}
    
        // While have not hit the </html> tag
        for z.Token().Data != "html" {
            tt := z.Next()
            if tt == html.StartTagToken {
                t := z.Token()
                if t.Data == "td" {
                    inner := z.Next()
                    if inner == html.TextToken {
                        text := (string)(z.Text())
                        t := strings.TrimSpace(text)
                        content = append(content, t)
                    }
                }
            }
        }
        // Print to check the slice's content
        fmt.Println(content)
    }
    

    This code is written only for this typical HTML pattern only, but refactoring it to be more general wouldn't be hard.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示
  • ¥15 求三国群英传pl国战时间的修改方法
  • ¥15 matlab代码代写,需写出详细代码,代价私
  • ¥15 ROS系统搭建请教(跨境电商用途)
  • ¥15 AIC3204的示例代码有吗,想用AIC3204测量血氧,找不到相关的代码。