将freebase数据转储修剪为仅英语实体

I have a compressed freebase data dump that has all the entities in it. How can I use grep or something else to trim the data dump to only contain english entities?

Here is what I am trying to get the rdf dump to look like: http://play.golang.org/p/-WwSysL3y3

<card>
    <title></title>
    <image></image>
    <text></text>
    <facts>
        <fact></fact>
        <fact></fact>
        <fact></fact>
    </fact>
</card>

Where card is each entity with content in all of the children elements. Title is the /type/object/name. Text is the image for mid of the topic done by "https://usercontent.googleapis.com/freebase/v1/image"%s" ", id. Text is the /common/document/text for the entity. and facts and its fact children as the facts like age, birth-date, height, the facts that show up in the knowledge panels in search.

Here is my attempt to parse the rdf into xml like this in Go ( Golang ). I'd appreciate it if someone could help me get the rdf in this form.

Here is the algorithm or logic of what I am trying to do:

For every entity written in english:

    parse the `type/object/name`property's  and write that to the xml file in the `<title></title>` element.

    parse the mid and add that to `https://usercontent.googleapis.com/freebase/v1/image`and then write the result to the xml file in the <image></image> element.

    parse the common/document/text property and writes its value to the <text></text> element.

    And lastly, for each fact about the entity, write them to the <fact></fact> elements in the XML file, which are all children of the <facts></facts> element.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
drrqwokuz71031449 2014-09-18 17:22
关注
I agree with Joshua Taylor that the question is difficult to decipher, because entity is usually a synonym for Freebase object, which may have labels in multiple languages (or no labels/text at all).

If we recast the question as something along the lines of "How do I filter all non-English text from the compressed Freebase dump?," it becomes something that we can actually answer.

In RDF, all strings are labeled with their language, so if we see something like

ns:award.award_winner rdfs:label "Lauréat"@fr.

We can tell that Lauréat is the French name for the Freebase type called Award Winner in English.

To filter out non-English labels, use zgrep to filter those lines which match "@... but not "@en. This will give you all the types, properties, numbers, and English labels/descriptions, but won't exclude those objects which don't have at least one English label (another possible interpretation of your question). To do that level of filtering, you'll probably need something more powerful than grep.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

将freebase数据转储修剪为仅英语实体
2014-09-16 13:39

回答 1 已采纳 I agree with Joshua Taylor that the question is difficult to decipher, because entity is usually a
os / exec将mysql数据转储到文件
2016-07-10 01:59

回答 2 已采纳 use this ( separate Args): cmd := exec.Command("mysqldump", "-P3306", "-hhost", "-uuser", "-ppa
我应该让php处理它或将所有数据转储到数据库[关闭] database mysql php
2013-08-04 02:56

回答 3 已采纳 Will the data change overtime, or are they fixed? If the data can change, better store them in da
AutoCAD（英文版）中所有英语词汇的翻译
2019-01-07 23:07

weixin_34216196的博客英文词汇大陆词汇台湾词汇 2D Solid 二维实体 2D 实面 2D Wireframe 二维线框 3D Array 三维阵列 3D 阵列 3D Dynamic View 三维动态观察 3D 动态检视 3d objects 三维物体 3D 物件 3D Orbit 三维轨道 3D 动态 3D ...
结构体数组访问溢出，在接收实时数据时，ROS会报错段错误 (核心已转储) c++ ubuntu 自动驾驶
2022-03-27 17:07

回答 1 已采纳已解决，您首先要根据valid_node_quantity这个变量判断有多少个block，判断node[]数组的长度 void filter(const nlink_parser::Linktrack
使用Golang将MySQL表转储为JSON mysql
2013-11-15 00:55

回答 6 已采纳 I also needed to dump database tables to json and here is how I achieved: (different than another
使用PHP将大型数据库转储为JSON json mysql php
2012-10-19 15:00

回答 3 已采纳 This is what I think that the problem is: you are using mysql_query. mysql_query buffers data in
论文阅读：Core techniques of question answering systems over knowledge bases: a survey
2020-07-02 17:33

晴晴_Amanda的博客文章目录@[TOC]KBQA 的核心技术综述0. 摘要1. 介绍2.3. KBQA 的核心技术综述 0. 摘要语义网中包含大量的信息，以知识库的形式存储。...本文将KBQA分成多个阶段，综述每个阶段的技术，同时探讨各项技术的优缺点
使用php 4.3.9将数据从数据库转储到excel文件 php
2011-04-01 08:56

回答 4 已采纳 Finally just found out this is the same problem I had. Just solved it by just setting UTF-8 to UTF
是否可以将golang db.Query（）输出转储为字符串？
2016-04-19 11:12

回答 1 已采纳 Pretty much: No. The Query method is going to return a pointer to a Rows struct: func (db *DB) Q
Golang Postgres如何将表转储到JSON postgresql
2017-10-10 21:03

回答 1 已采纳 There are two approaches to export table(s) into json file using COPY. Using go-pg. To avoid out
机器学习(ML)、深度学习（DL）和图像处理（opencv）专用英语词典
2017-07-07 10:39

wyx100的博客机器学习(ML)、深度学习（DL）和图像处理（opencv）专用英语词典百度翻译 http://fanyi.baidu.com/ A AAN （Active Appearance Model）主动外观模型 Adam(adaptive moment estimation,适应性矩估计),Adam是一...
TCP协议socket在子线程向缓冲区写数据出现段错误 (核心已转储) 问题 c++ 其他
2021-09-06 17:44

回答 2 已采纳 AnswerThread::AnswerThread(ros::NodeHandle& _nh,TCPClient _client)这里，第二个参数的_client，你如果是一个临时变量，出了作用域它
最全编程开发常用单词词汇
2021-02-26 16:10

www.bajins.com的博客其他 prepend 前置、预先其他 precision 精确其他 prune 修剪、精简其他 primitive 原始的其他 polyfill 补丁、填充工具其他 profile 轮廓、扼要描述其他 protocol 协议其他 port 端口、港口其他 provide ...
IModelDoc2 Interface 学习
2021-11-29 20:28

hd51cc的博客注意：此属性是一个仅获取属性。 ConfigurationManager Gets the IConfigurationManager object, which allows access to a configuration in a model. 获取 IConfigurationManager 对象，该对象允许访问模型中的...
scikit-learn_Scikit Learn-快速指南
2020-09-21 06:16

cunzai1985的博客以下示例将数据分成70:30的比例，即70％的数据将用作训练数据，而30％的数据将用作测试数据。数据集是虹膜数据集，如上例所示。 from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = ...
CEPH集群操作入门--配置
2018-11-30 10:35

武晓兵的博客基于RADOS，Ceph存储集群由两种类型的守护进程组成：Ceph OSD守护进程（OSD）将数据作为对象存储在存储节点上; Ceph Monitor（MON）维护集群映射的主副本。 Ceph存储集群可能包含数千个存储节点。最小系统将至少有...
英汉对照计算机专业词汇
2019-10-02 18:50

LL8217800的博客 alignment 数据对齐 all 全部 allocate 分配 allocation 分配 allocator 分配器 allow 允许 allowable a.容许的，承认的 allowance 允许 allowed a.容许的 ally v.联合，与...关联 alpha n.希腊字母α，...
AutoCAD机械制图英语词汇
2008-01-17 15:37

mybirdsky的博客 2006-04-05 21:25:53 AutoCAD机械制图英语词汇2D Solid 二维实体 2D 实面 2D Wireframe 二维线框 3D Array 三维阵列 3D 阵列 3D Dynamic View 三维动态观察 3D 动态检视 3d objects 三维物体 3D 物件 3D ...
没有解决我的问题, 去提问

悬赏问题

¥15 R语言Rstudio突然无法启动
¥15 关于#matlab#的问题：提取2个图像的变量作为另外一个图像像元的移动量，计算新的位置创建新的图像并提取第二个图像的变量到新的图像
¥15 改算法，照着压缩包里边，参考其他代码封装的格式写到main函数里
¥15 用windows做服务的同志有吗
¥60 求一个简单的网页(标签-安全|关键词-上传)
¥35 lstm时间序列共享单车预测，loss值优化，参数优化算法
¥15 Python中的request，如何使用ssr节点，通过代理requests网页。本人在泰国，需要用大陆ip才能玩网页游戏，合法合规。
¥100 为什么这个恒流源电路不能恒流？
¥15 有偿求跨组件数据流路径图
¥15 写一个方法checkPerson，入参实体类Person，出参布尔值

将freebase数据转储修剪为仅英语实体

1条回答 默认 最新

悬赏问题

1条回答默认最新