drmqzb5063 2017-05-17 06:36
浏览 91
已采纳

Golang XML解组HTML表格

I have a simple HTML table, and want to get all cell values even if it's HTML code inside.

Trying to use xml unmarshal, but didn't get the right struct tags, values or attributes.

import (
    "fmt"
    "encoding/xml"
)

type XMLTable struct {
XMLName xml.Name `xml:"TABLE"`
    Row []struct{
        Cell string `xml:"TD"`
    }`xml:"TR"`
}

func main() {
    raw_html_table := `
    <TABLE><TR>
    <TD>lalalal</TD>
    <TD>papapap</TD>
    <TD>fafafa</TD>
    <TD>
    <form action=\"/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method=POST>
    <input type=hidden name=acT value=\"Dev\">
    <input type=hidden name=acA value=\"Anyval\">
    <input type=submit name=submit value=Stop>
    </form>
    </TD>
    </TR>
    </TABLE>`

    table := XMLTable{}
    fmt.Printf("%q
", []byte(raw_html_table)[:15])
    err := xml.Unmarshal([]byte(raw_html_table), &table)
    if err != nil {
        fmt.Printf("error: %v", err)
    }
}

As an additional info, I don't care about cell content if it's HTML code (take only []byte / string values). So I may delete cell content before unmarshaling, but this way is also not so easy.

Any suggestions with standard golang libs would be welcome.

  • 写回答

2条回答 默认 最新

  • dp518158 2017-05-17 07:10
    关注

    Sticking to the standard lib

    Your input is not valid XML, so even if you model it right, you won't be able to parse it.

    First, you're using a raw string literal to define your input HTML as a string, and raw string literals cannot contain escapes. For example this:

    <form action=\"/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method=POST>
    

    You can't use \" in a raw string literal (you can, but it will mean exactly those 2 characters), and you don't have to, use a simple quotation mark: ".

    Next, in XML you cannot have attributes without putting their values in quotes.

    Third, each element must have a matching closing element, your <input> elements are not closed.

    So for example this line:

    <input type=hidden name=acT value=\"Dev\">
    

    Must be changed to:

    <input type="hidden" name="acT" value="Dev" />
    

    Ok, after these the input is a valid XML now.

    How to model it? Simple as this:

    type XMLTable struct {
        Rows []struct {
            Cell string `xml:",innerxml"`
        } `xml:"TR>TD"`
    }
    

    And the full code to parse and print contents of <TD> elements:

    raw_html_table := `
    <TABLE><TR>
    <TD>lalalal</TD>
    <TD>papapap</TD>
    <TD>fafafa</TD>
    <TD>
    <form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
    <input type="hidden" name="acT" value="Dev" />
    <input type="hidden" name="acA" value="Anyval" />
    <input type="submit" name="submit" value="Stop" />
    </form>
    </TD>
    </TR>
    </TABLE>`
    
    table := XMLTable{}
    err := xml.Unmarshal([]byte(raw_html_table), &table)
    if err != nil {
        fmt.Printf("error: %v
    ", err)
    }
    
    fmt.Println("count:", len(table.Rows))
    for _, row := range table.Rows {
        fmt.Println("TD content:", row.Cell)
    }
    

    Output (try it on the Go Playground):

    count: 4
    TD content: lalalal
    TD content: papapap
    TD content: fafafa
    TD content: 
        <form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
        <input type="hidden" name="acT" value="Dev" />
        <input type="hidden" name="acA" value="Anyval" />
        <input type="submit" name="submit" value="Stop" />
        </form>
    

    Using a proper HTML parser

    If you can't or don't want to change the input HTML, or you want to handle all HTML input not just valid XMLs, you should use a proper HTML parser instead of treating the input as XML.

    Check out https://godoc.org/golang.org/x/net/html for an HTML5-compliant tokenizer and parser.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看