duanchu0031 2019-02-02 05:35 采纳率: 0%
浏览 681
已采纳

XML解析返回带换行符的字符串

I am trying to parse XML via the sitemap, and then loop over the address to get the details of the post in Go. But I am getting this weird error:

: first path segment in URL cannot contain colon

This is the code snippet:

type SitemapIndex struct {
    Locations []Location `xml:"sitemap"`
}

type Location struct {
    Loc string `xml:"loc"`
}

func (l Location) String() string {
    return fmt.Sprintf(l.Loc)
}

func main() {
    resp, _ := http.Get("https://www.washingtonpost.com/news-sitemaps/index.xml")
    bytes, _ := ioutil.ReadAll(resp.Body)
    var s SitemapIndex
    xml.Unmarshal(bytes, &s)
    for _, Location := range s.Locations {
        fmt.Printf("Location: %s", Location.Loc)
        resp, err := http.Get(Location.Loc)
        fmt.Println("resp", resp)
        fmt.Println("err", err)
    }
}

And the output:

Location: 
https://www.washingtonpost.com/news-sitemaps/politics.xml
resp <nil>
err parse 
https://www.washingtonpost.com/news-sitemaps/politics.xml
: first path segment in URL cannot contain colon
Location: 
https://www.washingtonpost.com/news-sitemaps/opinions.xml
resp <nil>
err parse 
https://www.washingtonpost.com/news-sitemaps/opinions.xml
: first path segment in URL cannot contain colon
...
...

My guess is that the Location.Loc returns a new line before and after the actuall address. Eg: Location: https://www.washingtonpost.com/news-sitemaps/politics.xml

Because hardcoding the URL works as expected:

for _, Location := range s.Locations {
        fmt.Printf("Location: %s", Location.Loc)
        test := "https://www.washingtonpost.com/news-sitemaps/politics.xml"
        resp, err := http.Get(test)
        fmt.Println("resp", resp)
        fmt.Println("err", err)
    }

Output, as you can see the error is nil:

Location: 
https://www.washingtonpost.com/news-sitemaps/politics.xml
resp &{200 OK 200 HTTP/2.0 2 0 map[Server:[nginx] Arc-Service:[api] Arc-Org-Name:[washpost] Expires:[Sat, 02 Feb 2019 05:32:38 GMT] Content-Security-Policy:[upgrade-insecure-requests] Arc-Deployment:[washpost] Arc-Organization:[washpost] Cache-Control:[private, max-age=60] Arc-Context:[index] Arc-Application:[Feeds] Vary:[Accept-Encoding] Content-Type:[text/xml; charset=utf-8] Arc-Servername:[api.washpost.arcpublishing.com] Arc-Environment:[index] Arc-Org-Env:[washpost] Arc-Route:[/feeds] Date:[Sat, 02 Feb 2019 05:31:38 GMT]] 0xc000112870 -1 [] false true map[] 0xc00017c200 0xc0000ca370}
err <nil>
Location: 
...
...

But I am very new to Go, and so I have no idea what's wrong. Could you please tell me where I am wrong?

  • 写回答

2条回答 默认 最新

  • drra6593 2019-02-02 07:11
    关注

    You are right indeed, the issue comes from the newlines. As you can see, you are using Printf without adding any and one is added at the beginning and one at the end in the output.

    You can use strings.Trim to remove those newlines. Here is an example working with the sitemap that you are trying to parse. Once the string is trimmed, you will be able to call http.Get on it without any errors.

    func main() {
        var s SitemapIndex
        xml.Unmarshal(bytes, &s)
    
        for _, Location := range s.Locations {
            loc := strings.Trim(Location.Loc, "
    ")
            fmt.Printf("Location: %s
    ", loc)
        }
    }
    

    This code properly outputs the locations without any newlines, as expected:

    Location: https://www.washingtonpost.com/news-sitemaps/politics.xml
    Location: https://www.washingtonpost.com/news-sitemaps/opinions.xml
    Location: https://www.washingtonpost.com/news-sitemaps/local.xml
    Location: https://www.washingtonpost.com/news-sitemaps/sports.xml
    Location: https://www.washingtonpost.com/news-sitemaps/national.xml
    Location: https://www.washingtonpost.com/news-sitemaps/world.xml
    Location: https://www.washingtonpost.com/news-sitemaps/business.xml
    Location: https://www.washingtonpost.com/news-sitemaps/technology.xml
    Location: https://www.washingtonpost.com/news-sitemaps/lifestyle.xml
    Location: https://www.washingtonpost.com/news-sitemaps/entertainment.xml
    Location: https://www.washingtonpost.com/news-sitemaps/goingoutguide.xml
    

    The reason why you have those newlines in the Location.Loc field is because of the XML returned by this URL. Entries are following this form:

    <sitemap>
    <loc>
    https://www.washingtonpost.com/news-sitemaps/goingoutguide.xml
    </loc>
    </sitemap>
    

    And as you can see, there are newlines before and after the content within the loc elements.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 nginx反向代理获取ip,java获取真实ip
  • ¥15 eda:门禁系统设计
  • ¥50 如何使用js去调用vscode-js-debugger的方法去调试网页
  • ¥15 376.1电表主站通信协议下发指令全被否认问题
  • ¥15 物体双站RCS和其组成阵列后的双站RCS关系验证
  • ¥15 复杂网络,变滞后传递熵,FDA
  • ¥20 csv格式数据集预处理及模型选择
  • ¥15 部分网页页面无法显示!
  • ¥15 怎样解决power bi 中设置管理聚合,详细信息表和详细信息列显示灰色,而不能选择相应的内容呢?
  • ¥15 QTOF MSE数据分析