doucheng4094 2014-10-15 19:36
浏览 80
已采纳

使用Go解析RDF三倍。 一些错误传递正则表达式的项目

When parsing the Freebase RDF data-dump, I'm trying to only parse certain entities based on their title's and text. I'm using regexps to match the titles and text and even though they are not matching, returning false, the content is still passing.

How I'm deciding what to turn into XML is the properties["/type/object/name"] is not empty or if it contains @en and if the properties["/common/document/text"] is not empty.

What defines empty? I've noticed, by printing all the names ( properties["/type/object/name"] ) and text ( properties["/common/document/text"] ), and I noticed that some of them are just "[]". I don't want those. What I do want are the ones that are not that and contain @en in the name ( properties["/type/object/name"] ). The text ( properties["/common/document/text"] ) won't have the @en so if it is not "[]" and its corresponding name has @en, then that entity should be converted to XML.

As I run my code, I'm using regexps to see if it matches and doesn't match those things, I'm seeing those are being ignored and those ' empty entities ' are still being converted to XML.

Here is some output I grabbed from the terminal:

<card>
<title>"[]"</title>
<image>"https://usercontent.googleapis.com/freebase/v1/image"</image>
%!(EXTRA string=/american_football/football_player/footballdb_id)<text>"[]"</text>
<facts>
    <fact property="/type/object/type">/type/property</fact>
    <fact property="/type/property/schema">/american_football/football_player</fact>
    <fact property="/type/property/unique">true</fact>
    <fact property="http://www/w3/org/2000/01/rdf-schema#label">"footballdb ID"@en</fact>
    <fact property="/type/property/expected_type">/type/enumeration</fact>
    <fact property="http://www/w3/org/1999/02/22-rdf-syntax-ns#type">http://www/w3/org/2002/07/owl#FunctionalProperty</fact>

    <fact property="http://www/w3/org/2000/01/rdf-schema#domain">/american_football/football_player</fact>

    <fact property="http://www/w3/org/2000/01/rdf-schema#range">/type/enumeration</fact>
 </facts>
 </card>

Here is my code, below, what am I doing wrong? Shouldn't it match the regexps and then not write what it did write?

func validTitle(content []string) bool{
    for _, v := range content{
         emptyTitle, _ := regexp.MatchString("\"[]\"", v)
         validTitle, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
         englishTitle, _ := regexp.MatchString("@en", v)
         if (!validTitle || !englishTitle) && !emptyTitle{
              return false
         }
    }
    return true 
 }

 func validText(content []string) bool{
      for _, v := range content{
          emptyTitle, _ := regexp.MatchString("\"[]\"", v)
          validText, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
          if !validText && !emptyTitle{
             return false
          }
      }
      return true
 }

 func processTopic(id string, properties map[string][]string, file io.Writer){
      if validTitle(properties["/type/object/name"]) &&       validText(properties["/common/document/text"]){
           fmt.Fprintf(file, "<card>
")
           fmt.Fprintf(file, "<title>\"%s\"</title>
", properties["/type/object/name"])
           fmt.Fprintf(file, "<image>\"%s\"</image>
", "https://usercontent.googleapis.com/freebase/v1/image", id)
           fmt.Fprintf(file, "<text>\"%s\"</text>
", properties["/common/document/text"])
           fmt.Fprintf(file, "<facts>
")
           for k, v := range properties{
                for _, value := range v{
                    fmt.Fprintf(file, "<fact property=\"%s\">%s</fact>
", k, value)
                }
           }
           fmt.Fprintf(file, "</facts>
")
           fmt.Fprintf(file, "</card>
")
      }
 }
  • 写回答

1条回答 默认 最新

  • dongya0914 2014-10-15 22:02
    关注

    Your regexp is invalid, if you check the error it will tell you exactly why:

    error parsing regexp: missing closing ]: `[]"`
    
    regexp.MatchString("\"[]\"", v)
    // should be
    regexp.MatchString(`"\[\]"`, v)
    

    Also since you use it multiple times, you should compile it outside the function and use it, for example:

    var (
        emptyRe   = regexp.MustCompile(`"\[\]"`)
        titleRe   = regexp.MustCompile("^[A-Za-z0-9][A-Za-z0-9_-]*$")
        englishRe = regexp.MustCompile("@en")
    )
    
    func validTitle(content []string) bool {
        for _, v := range content {
            if emptyRe.MatchString(v) || !(englishRe.MatchString(v) || titleRe.MatchString(v)) {
                return false
            }
        }
        return true
    }
    

    This line expects 1 value as input but you're giving it two:

    fmt.Fprintf(file, "<image>\"%s\"</image>
    ", 
                "https://usercontent.googleapis.com/freebase/v1/image", // this matches the %s
                 id, // this doesn't
    ) 
    

    It should be

    fmt.Fprintf(file, "<image>\"%s/%s\"</image>
    ", "https://usercontent.googleapis.com/freebase/v1/image", id)
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图