Information Extraction

roblem Description
Extracting the information from HTML document is a complex task. For example, journalists often need to extract some news from Web site (including the title, content, time, etc.), and then made it into a specific format. Because different HTML document has different structure and the artificial way to copy and paste is too tedious, now you need to write a program to store a specific output format, the structure of some HTML documents and the mapping from the structure to the output format. When it inputs a HTML document, according to the mapping, you need to output specific format text.
HTML documents are defined as follows：
HTML
　　HTML stands for HyperText Markup Language.
　　HTML is a markup language.
　　A markup language is a set of markup tags.
　　The tags describe document content.
　　HTML documents consist of tags and texts.
Tags
　　HTML is using tags for its syntax.
　　A tag is composed with special characters: ‘<’, ‘>’ and ‘/’.
　　Tags usually come in pairs, the opening tag and the closing tag.
　　The opening tag starts with “<” and the tagname. It usually ends with a “>”.
　　The closing tag starts with “</” and the same tagname as the corresponding opening tag. It ends with a “>”.
　　There will not be any other angle brackets in the documents.
　　Tagnames are strings containing only lowercase letters.
　　Tags will contain no line break (‘\n’).
　　Except tags, anything occured in the document is considered as text content.
The length of tagname less than or equal 30
Elements
　　An element is everything from an opening tag to the matching closing tag (including the two tags).
　　The element content is everything between the opening and the closing tag.
　　Some elements may have no content. They’re called empty elements, like

.
　　Empty elements can be closed in the opening tag, ending with a “/>” instead of “>”.
　　All elements are closed either with a closing tag or in the opening tag.
　　Elements can have attributes.
　　Elements can be nested (can contain other elements).
　　The element is the container for all other elements, it will not have any attributes.
Attributes
　　Attributes provide additional information about an element.
　　Attributes are always specified in the opening tag after the tagname.
　　Tag name and attributes are separated by single space.
　　An element may have several attributes.
　　Attributes come in name="value" pairs like class="icpc".
　　There will not be any space around the '='.
　　All attribute names are in lowercase.
The value of the id attribute is unique and the length less than or equal 30.
A Simple Example

this is a test

this is content

var x = 1111;

The structure of a HTML document is the HTML document, but only have id attribute and elements have no text content. A HTML document may have many structures, because the content of some elements may be removed.
A Simple Example

The specific output format like a HTML document, but the container for all other elements is not necessarily the element.
A Simple Example

others

The mapping from the structure to the output format is defined as follows：
The value of a id attribute of a structure-A tagname of the ouput format
(It means the content of the element which the id belongs to should as the content of elements whose tagname is the tagname)
Two Simple Examples
header-title
content-content

Input
The first line of the input is an integer T (T<=15) representing the number of test cases.
Each test case is a specific output format (Length less than or equal 10000) in front.
Then input is an integer N (0<=N<=30) representing the number of the type of HTML document.
Each type is the structure of HTML document in front.
Then input is an integer M (0<=M<=30) representing the number of the mapping from the structure to the specific output format. Each of the next M lines is a mapping.
Each test case is a HTML document (Length less than or equal 10000) at the end.

Output
For each test case, first output a line “Case #x:”, where x is the case number (starting from 1). If there exists the structure of the HTML doument, output specific format text, otherwise output “Can't Identify”. If there exists more than one structure of the HTML doument, use the early input structure.

Sample Input
2

default title

1

2 header-title content-content

this is a test

this is content

var x = 1111;

default title

1

1
header-title

xxxx

Sample Output
Case #1:

this is a test

this is content

var x = 1111;

Case #2:
Can't Identify

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

OSError: [E941] Can't find model 'en'. It looks like you're trying to load a model from a shortcut nlp python 有问必答自然语言处理
2022-04-03 23:00

回答 2 已采纳可参考一下这里的解决方法：https://blog.csdn.net/sinat_36226553/article/details/110819567
麻烦大家帮我debug一个reshape array的问题 tensorflow 机器学习深度学习自然语言处理
2020-10-16 05:03

回答 1 已采纳？什么意思，128754个元素要怎么转成 1 x 1118的格式啊？
BERT微调报错KeyError: tensorflow 机器学习深度学习神经网络自然语言处理
2020-05-24 22:52

回答 2 已采纳 https://github.com/terrifyzhao/bert-utils/issues/10
编程语言那些事儿
2018-03-02 16:18

FarmerJohn的博客前言：本文根据网上各方材料整理总结而成。本人过去几年使用过的编程语言包括：C、C++、C#、...编程语言五大家族早期的编程语言分为：FORTRAN、 COBOL、 LISP、 BASIC、和ALGOL 家族。这些语言为不同的社会群体而设...
惯用地干燥Go中的常见字段
2014-06-15 17:44

回答 2 已采纳 Define an interface for a Post - don't access common data elements except through an interface. H
envi导出shp图时卡主不动怎么办！🆘各位棒棒俺毕业论文写不完了图像处理
2022-04-02 00:43

回答 1 已采纳你可以看下生成目录下面是不是有导出的数据，但是文件大小又不对（比完整的文件小），如果有就说明不是软件的问题，应该是你的cpu计算慢或者内存不够吧，换台配置高点的电脑看看？或者你可以试试看下其他方式转换
UE5.1打开场景崩溃 ue5
2023-02-10 09:51

回答 2 已采纳修改抗锯齿模式为非tsr就好了和资源一点关系都没有
Dictionary-based methods for information extraction
2021-07-09 18:47

Pioneer_LIC的博客这种描述语言可以基于任何计算机编程语言，如Lisp、Pascal或Java。如果P是一个输出字符串x的程序，那么P是x的描述。描述的长度就是P作为字符串的长度，乘以一个字符的位数我们也可以为图灵机选择一个编码，其中...
关于#BUG#的问题，如何解决？工程文件内的图标无法显示) java
2023-02-10 19:03

回答 1 已采纳 RandomNumber 29行报空指针，检查一下 // 这样报了空指针，getClassLoader 获取为空了 static ImageIcon icon = new ImageIcon(Rand
TF-IDF特征选取和划分数据集 python
2023-02-10 22:12

回答 2 已采纳首先，我们需要把文本数据处理为特征矩阵。这可以使用sklearn库中的CountVectorizer和TfidfTransformer实现： # 实例化CountVectorizer vectoriz
机器学习去除停用词问题 sklearn 机器学习
2022-10-25 13:23

回答 1 已采纳 data是个表啊for word in data:word是个行吗？
最全编程开发常用单词词汇
2021-02-26 16:10

www.bajins.com的博客面向对象编程常用 JDK（Java development kit） java开发工具包常用 JVM（java virtual machine）虚拟机常用 classpath 类路径常用 Version 版本常用 author 作者常用 java 解释命令常用 ...
在 TF-IDF 特征提取的基础上对模型建立与评估 python 逻辑回归
2023-02-11 22:48

回答 4 已采纳 import numpy as np import pandas as pd import time import jieba import re import string import pick
面向半结构化数据处理的混合本体编程-研究论文
2021-06-10 01:52

在这项工作中，介绍了一种面向本体的混合编程语言的设计。它结合了用于构建领域模型的声明式风格和用于处理它的命令式风格。领域模型采用本体的形式，由一组相互关联的概念组成。概念定义的语法允许通过将现有...
2019语言与智能技术竞赛.zip
2024-01-14 12:58

信息抽取(Information Extraction) 个人baseline with BERT Java是一种高性能、跨平台的面向对象编程语言。它由Sun Microsystems（现在是Oracle Corporation）的James Gosling等人在1995年推出，被设计为一种简单、...
c语言串口通信编程_串口编程语言
2020-07-19 19:32

cunchi8090的博客 c语言串口编程介绍 (Introduction) 尽管这是一项旧技术，但许多硬件制造商仍在使用串行端口。 If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the ...
自然语言处理研究的内容
2024-01-21 18:29

Algorithm_Engineer_的博客自然语言处理的相关知识的基本介绍
c 语言编程框架_使用此框架选择您的下一种编程语言
2020-07-29 03:42

cumian9828的博客 c 语言编程框架Human capital is our greatest asset. 人力资本是我们最大的财富。 Like financial capital, the all-powerful force of compound growth means that a small difference in the rate of skill ...
如何运用Python编程语言结合NLP技术进行医疗领域的文本分析工作:用Python结合PyTorch和transformers库进行医疗领域的NER和RE任务
2023-08-05 01:12

禅与计算机程序设计艺术的博客由中文Medline数据库和ClinicalTrials.gov数据库搜集的10万篇论文组成，既包括各国语言的论文，也包括英文和德文等其它语言的论文。文章从中抽取出的文本，包含了3种实体类型：疾病（disease），药物（drug），症状...
没有解决我的问题, 去提问

悬赏问题

¥15 乘性高斯噪声在深度学习网络中的应用
¥15 运筹学排序问题中的在线排序
¥15 关于docker部署flink集成hadoop的yarn，请教个问题 flink启动yarn-session.sh连不上hadoop，这个整了好几天一直不行，求帮忙看一下怎么解决
¥30 求一段fortran代码用IVF编译运行的结果
¥15 深度学习根据CNN网络模型，搭建BP模型并训练MNIST数据集
¥15 C++ 头文件/宏冲突问题解决
¥15 用comsol模拟大气湍流通过底部加热（温度不同）的腔体
¥50 安卓adb backup备份子用户应用数据失败
¥20 有人能用聚类分析帮我分析一下文本内容嘛
¥30 python代码，帮调试，帮帮忙吧

码龄粉丝数原力等级 --

Information Extraction

this is a test

this is a test

xxxx

0条回答默认最新

悬赏问题

Information Extraction

this is a test

this is a test

xxxx

0条回答 默认 最新

悬赏问题

0条回答默认最新