dongzhi6905 2014-09-16 13:39
浏览 56
已采纳

将freebase数据转储修剪为仅英语实体

I have a compressed freebase data dump that has all the entities in it. How can I use grep or something else to trim the data dump to only contain english entities?

Here is what I am trying to get the rdf dump to look like: http://play.golang.org/p/-WwSysL3y3

<card>
    <title></title>
    <image></image>
    <text></text>
    <facts>
        <fact></fact>
        <fact></fact>
        <fact></fact>
    </fact>
</card>

Where card is each entity with content in all of the children elements. Title is the /type/object/name. Text is the image for mid of the topic done by "https://usercontent.googleapis.com/freebase/v1/image"%s" ", id. Text is the /common/document/text for the entity. and facts and its fact children as the facts like age, birth-date, height, the facts that show up in the knowledge panels in search.

Here is my attempt to parse the rdf into xml like this in Go ( Golang ). I'd appreciate it if someone could help me get the rdf in this form.

Here is the algorithm or logic of what I am trying to do:

For every entity written in english:

    parse the `type/object/name`property's  and write that to the xml file in the `<title></title>` element.

    parse the mid and add that to `https://usercontent.googleapis.com/freebase/v1/image`and then write the result to the xml file in the <image></image> element.

    parse the common/document/text property and writes its value to the <text></text> element.

    And lastly, for each fact about the entity, write them to the <fact></fact> elements in the XML file, which are all children of the <facts></facts> element.
  • 写回答

1条回答 默认 最新

  • drrqwokuz71031449 2014-09-18 17:22
    关注

    I agree with Joshua Taylor that the question is difficult to decipher, because entity is usually a synonym for Freebase object, which may have labels in multiple languages (or no labels/text at all).

    If we recast the question as something along the lines of "How do I filter all non-English text from the compressed Freebase dump?," it becomes something that we can actually answer.

    In RDF, all strings are labeled with their language, so if we see something like

    ns:award.award_winner   rdfs:label      "Lauréat"@fr.
    

    We can tell that Lauréat is the French name for the Freebase type called Award Winner in English.

    To filter out non-English labels, use zgrep to filter those lines which match "@... but not "@en. This will give you all the types, properties, numbers, and English labels/descriptions, but won't exclude those objects which don't have at least one English label (another possible interpretation of your question). To do that level of filtering, you'll probably need something more powerful than grep.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值