从多级标记XML结构中提取数据

I am trying to extract data from multi level structured XML file. The Input file will be

This is the search result of the query http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=24874852&retmode=xml&rettype=abstract&email=abc@xyz.com

Output of the query:

<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
    <PubmedArticle>
        <MedlineCitation Status="Publisher" Owner="NLM">
            <PMID Version="1">24874852</PMID>
            <DateCreated>
                <Year>2014</Year>
                <Month>5</Month>
                <Day>30</Day>
            </DateCreated>  
            <Article PubModel="Print-Electronic">
                <Journal> 
                    <ISSN IssnType="Electronic">1976-670X</ISSN>
                    <JournalIssue CitedMedium="Internet">
                        <PubDate>
                            <Year>2014</Year>
                            <Month>May</Month>
                            <Day>30</Day>
                        </PubDate>
                    </JournalIssue>
                    <Title>BMB reports</Title>
                    <ISOAbbreviation>BMB Rep</ISOAbbreviation>
                </Journal>
                <ArticleTitle>
                    Human selenium binding protein-1 (hSP56) is a negative regulator of HIF-1α and suppresses the malignant characteristics of prostate cancer cells.
                </ArticleTitle>
                <Pagination>
                    <MedlinePgn/>
                </Pagination>
                <ELocationID EIdType="pii">2831</ELocationID>
                <Abstract>
                    <AbstractText NlmCategory="UNLABELLED">
                        In the present study, we demonstrate that ectopic expression of 56-kDa human selenium binding protein-1 (hSP56) in PC-3 cells that do not normally express hSP56 results in a marked inhibition of cell growth in vitro and in vivo. Down-regulation of hSP56 in LNCaP cells that normally express hSP56 results in enhanced anchorage-independent growth. PC-3 cells expressing hSP56 exhibit a significant reduction of hypoxia inducible protein (HIF)-1α protein levels under hypoxic conditions without altering HIF-1α mRNA (HIF1A) levels. Taken together, our findings strongly suggest that hSP56 plays a critical role in prostate cells by mechanisms including negative regulation of HIF-1α, thus identifying hSP56 as a candidate anti-oncogene product.
                    </AbstractText>
                </Abstract>
                <AuthorList>
                    <Author>
                        <LastName>Jeong</LastName>
                        <ForeName>Jee-Yeong</ForeName>
                        <Initials>JY</Initials>
                        <Affiliation>
                            Laboratory for Cell and Molecular Biology, Division of Hematology and Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA; Department of Biochemistry and Cancer Research Institute, Kosin University College of Medicine, Busan, South Korea.
                        </Affiliation>
                    </Author>
                    <Author>
                        <LastName>Zhou</LastName>
                        <ForeName>Jin-Rong</ForeName>
                        <Initials>JR</Initials>
                    </Author>
                    <Author>
                        <LastName>Gao</LastName>
                        <ForeName>Chong</ForeName>
                        <Initials>C</Initials>
                    </Author>
                    <Author>
                        <LastName>Feldman</LastName>
                        <ForeName>Laurie</ForeName>
                        <Initials>L</Initials>
                    </Author>
                    <Author>
                        <LastName>Sytkowski</LastName>
                        <ForeName>Arthur J</ForeName>
                        <Initials>AJ</Initials>
                    </Author>
                </AuthorList>
                <Language>ENG</Language>
                <PublicationTypeList>
                    <PublicationType>JOURNAL ARTICLE</PublicationType>
                </PublicationTypeList>
                <ArticleDate DateType="Electronic">
                    <Year>2014</Year>
                    <Month>5</Month>
                    <Day>30</Day>
                </ArticleDate>
            </Article>
            <MedlineJournalInfo>
                <MedlineTA>BMB Rep</MedlineTA>
                <NlmUniqueID>101465334</NlmUniqueID>
                <ISSNLinking>1976-6696</ISSNLinking>
            </MedlineJournalInfo>
        </MedlineCitation>
        <PubmedData>
            <History>
                <PubMedPubDate PubStatus="entrez">
                    <Year>2014</Year>
                    <Month>5</Month>
                    <Day>31</Day>
                    <Hour>6</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="pubmed">
                    <Year>2014</Year>
                    <Month>5</Month>
                    <Day>31</Day>
                    <Hour>6</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="medline">
                    <Year>2014</Year>
                    <Month>5</Month>
                    <Day>31</Day>
                    <Hour>6</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>
            </History>
            <PublicationStatus>aheadofprint</PublicationStatus>
            <ArticleIdList>
                <ArticleId IdType="pii">2831</ArticleId>
                <ArticleId IdType="pubmed">24874852</ArticleId>
            </ArticleIdList>
        </PubmedData>
    </PubmedArticle>
</PubmedArticleSet>

My intention is to reorganise the data in another webpage. I am trying extract data from every layer of this structure. I am using regex. Eg, If I want to extract the abstract text from the xml structure, Here is the code I am using:

$o=urlencode("24874852");
$efetch = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&id=$o&retmode=xml&rettype=abstract&email=abc@xyz.com";
#echo $efetch;
$handle1 = file_get_contents($efetch,"r");
#echo $handle1s;
preg_match_all('/<AbstractText>\s*([0-9A-Za-z\.\_
]+)\s*   
<\/AbstractText>/s',$handle1,$abstext,PREG_PATTERN_ORDER)
foreach ($abstext[1] as $tiab){
echo $tiab; }`

I dont get the desired output that I expect. Any idea where it might have gone wrong?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongyizhuang0134 2014-06-04 15:10
关注
If you are going to extract text from XML, the best option is to use an XML parser, such as a DOM parser:

$document = new DOMDocument(); $document->load( "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=24874852&retmode=xml&rettype=abstract&email=abc@xyz.com" );

From there you can use the XPath language to select the data you want to extract: //AbstractText will return a set of all <AbstractText> nodes.

You can use XPath in PHP on your parsed document:

$xpath = new DOMXpath($document);

To get all nodes you use:

$xpath->evaluate("//AbstractText")

And to extract the text from each node use nodeValue:

foreach ($xpath->evaluate("//AbstractText") as $abstractText) { echo $abstractText->nodeValue." "; }

See a working example using your data here: http://codepad.viper-7.com/nlryKH
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

从多级标记XML结构中提取数据 php xml
2014-06-04 14:43

回答 1 已采纳 If you are going to extract text from XML, the best option is to use an XML parser, such as a DOM
嵌套在数据中的Xml php属性 php xml
2014-12-27 21:23

回答 2 已采纳 You could use the XML DOM Parser to query the XML: $doc = new DOMDocument; $doc->Load('file.xm
python如何提取多级字典中的键值对？ python 数据挖掘
2020-03-17 16:04

回答 1 已采纳试试这样，如果字典的KEY 不存在重复的情况，这样是可行的，如果KEY有重复，只保存最后一个值，也可以在 new_dict[k] = v 这行再做是否已存在KEY ```
php 使用xpath_在PHP中使用XPath
2020-06-20 21:06

cuyi7076的博客常用缩略语 API：应用程序编程接口 CRUD：创建，读取，更新和删除 CSS：级联样式表 DOM：文档对象模型 JSON：JavaScript对象表示法 RDF：资源描述框架 ... XHTML：可扩展超文本标记语言 XPath：XM...
给定任意dict数据，请实现方法提取数据字典，即将树状多级字段压缩为一级字段 python
2022-03-21 16:40

回答 2 已采纳各种数据类型都测试了，没问题的话请点击采纳
如何使用php和mysql在多维数组中返回多级类别 mysql php
2019-04-29 06:09

回答 1 已采纳 Try this, $mysqli = new mysqli("localhost", "user", "password", "database"); if ($mysqli->conn
如何在php中读取多级json的值？ json php
2018-02-17 21:53

回答 3 已采纳 The JSON contains an array of one element, so to access provider_id and nro_chart, get the first e
两千行PHP学习笔记
2020-12-19 16:02

hupc的博客 array_rand 从数组中随机取出一个或多个单元，返回键名或键名组成的数组，下标是自然排序的 array_fill 用给定的值填充数组 array_fill($start, $num, $value) array_flip 交换数组中的键和值 array_pad 用值将数组...
PHP 7 - 在回调中迭代填充多级数组 php
2018-09-28 00:12

回答 2 已采纳 I think the problem is array_pop which doesn't return actual reference to the last element. This o
ideaWeb项目打包后的target文件夹中因为多级目录折叠导致mapper.xml文件不在正确的dao文件夹 idea java spring boot 有问必答
2021-12-06 15:54

回答 2 已采纳很少见你这样写的，一般需要给映射文件配置路径，如果不配置映射文件的路径，默认就是在当前文件夹下找对应的xml文件，所以说你没必要在pom.xml配置映射了，通常情况下dao只放接口文件不放xml文件，
使用PHP和MYSQL的响应式垂直多级菜单 mysql php
2018-08-03 06:21

回答 1 已采纳 Try Below, this will work for your senario. CSS for tree view ul.tree, ul.tree ul { list-
2021-PHP核心技术经典面试题
2021-05-22 09:40

浅糖不是糖的博客 1.写出一个能创建多级目录的PHP函数 <?php /** * 创建多级目录 * @param $path string 要创建的目录 * @param $mode int 创建目录的模式，在windows下可忽略 */ function create_dir($path,$mode = 0777...
idea中无法创建多级目录 intellij-idea java 有问必答
2021-09-13 18:57

回答 6 已采纳实际上是创建了多重目录的,但是idea优化显示了你在这里切换一下显示模式就能看到实际创建的文件结构了有帮助望采纳
面试题php2018,2018php最新面试题之PHP核心技术
2021-04-02 08:36

weixin_39744512的博客这篇文章给大家分享的是关于2018php面试题之PHP核心技术最新，有需要的朋友可以参考一下。一、PHP核心技术1.写出一个能创建多级目录的PHP函数(新浪网技术部)...
php面试题之PHP核心技术
2019-12-27 21:03

八重樱。的博客很多人在刚接触这个行业的时候或者是在遇到瓶颈期的时候，总会遇到一些问题，比如学了一段时间感觉没有方向感，不知道该从那里入手去学习，对此我整理了一些资料，需要的可以免费分享给大家（点击此处加入php高级...
没有解决我的问题, 去提问

悬赏问题

¥15 关于#java#的问题：找一份能快速看完mooc视频的代码
¥15 这种微信登录授权谁可以做啊
¥15 请问我该如何添加自己的数据去运行蚁群算法代码
¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”
¥15 网络设备配置与管理这个该怎么弄
¥20 机器学习能否像多层线性模型一样处理嵌套数据
¥20 西门子S7-Graph,S7-300，梯形图
¥50 用易语言http 访问不了网页
¥50 safari浏览器fetch提交数据后数据丢失问题
¥15 matlab不知道怎么改，求解答！！

从多级标记XML结构中提取数据

1条回答 默认 最新

悬赏问题

1条回答默认最新