When parsing the Freebase RDF data-dump, I'm trying to only parse certain entities based on their title's and text. I'm using regexps to match the titles and text and even though they are not matching, returning false, the content is still passing.
How I'm deciding what to turn into XML is the properties["/type/object/name"] is not empty or if it contains @en and if the properties["/common/document/text"] is not empty.
What defines empty? I've noticed, by printing all the names ( properties["/type/object/name"] ) and text ( properties["/common/document/text"] ), and I noticed that some of them are just "[]". I don't want those. What I do want are the ones that are not that and contain @en in the name ( properties["/type/object/name"] ). The text ( properties["/common/document/text"] ) won't have the @en so if it is not "[]" and its corresponding name has @en, then that entity should be converted to XML.
As I run my code, I'm using regexps to see if it matches and doesn't match those things, I'm seeing those are being ignored and those ' empty entities ' are still being converted to XML.
Here is some output I grabbed from the terminal:
<card>
<title>"[]"</title>
<image>"https://usercontent.googleapis.com/freebase/v1/image"</image>
%!(EXTRA string=/american_football/football_player/footballdb_id)<text>"[]"</text>
<facts>
<fact property="/type/object/type">/type/property</fact>
<fact property="/type/property/schema">/american_football/football_player</fact>
<fact property="/type/property/unique">true</fact>
<fact property="http://www/w3/org/2000/01/rdf-schema#label">"footballdb ID"@en</fact>
<fact property="/type/property/expected_type">/type/enumeration</fact>
<fact property="http://www/w3/org/1999/02/22-rdf-syntax-ns#type">http://www/w3/org/2002/07/owl#FunctionalProperty</fact>
<fact property="http://www/w3/org/2000/01/rdf-schema#domain">/american_football/football_player</fact>
<fact property="http://www/w3/org/2000/01/rdf-schema#range">/type/enumeration</fact>
</facts>
</card>
Here is my code, below, what am I doing wrong? Shouldn't it match the regexps and then not write what it did write?
func validTitle(content []string) bool{
for _, v := range content{
emptyTitle, _ := regexp.MatchString("\"[]\"", v)
validTitle, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
englishTitle, _ := regexp.MatchString("@en", v)
if (!validTitle || !englishTitle) && !emptyTitle{
return false
}
}
return true
}
func validText(content []string) bool{
for _, v := range content{
emptyTitle, _ := regexp.MatchString("\"[]\"", v)
validText, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
if !validText && !emptyTitle{
return false
}
}
return true
}
func processTopic(id string, properties map[string][]string, file io.Writer){
if validTitle(properties["/type/object/name"]) && validText(properties["/common/document/text"]){
fmt.Fprintf(file, "<card>
")
fmt.Fprintf(file, "<title>\"%s\"</title>
", properties["/type/object/name"])
fmt.Fprintf(file, "<image>\"%s\"</image>
", "https://usercontent.googleapis.com/freebase/v1/image", id)
fmt.Fprintf(file, "<text>\"%s\"</text>
", properties["/common/document/text"])
fmt.Fprintf(file, "<facts>
")
for k, v := range properties{
for _, value := range v{
fmt.Fprintf(file, "<fact property=\"%s\">%s</fact>
", k, value)
}
}
fmt.Fprintf(file, "</facts>
")
fmt.Fprintf(file, "</card>
")
}
}