首先查看节点结构如下,需要获取节点summaryrecordstable下的所有后代节点中,具有数据值id=RECORD_[0-9]的节点
代码如下
import requests
from lxml import etree
url_base='https://apps.webofknowledge.com'
url_test='https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=6C4xx6Mer35kGYw4PU7&page=1&doc=50'
url_head={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.57'
}
session=requests.session();
res=session.get(url_test,headers=url_head)
res_html = etree.HTML(res.text)
url_cited_post=res_html.xpath('//a[@title="View all of the articles that cite this one"]/@href')#如果把@herf替换为text(),获取不到信息,因为a标签只有属性没有文本
print(url_cited_post[0])
url_allcited=url_base+url_cited_post[0]
res=session.get(url_allcited,headers=url_head)
res_html = etree.HTML(res.text)
# url_cited_post=res_html.xpath('//div[contains(@id,"RECORD_")]')
url_cited_post=res_html.xpath('//div[contains(@id,"summaryRecordsTable")]//div[contains(@id,"RECORD_")]')
print('end')
倒数第三行,井号注释哪一行可以正常输出,但是加上前缀后,如倒数第二行,就不能正常输出了,不知道是怎么回事