首先查看节点结构如下,需要获取节点summaryrecordstable下的所有后代节点中,具有数据值id=RECORD_[0-9]的节点
代码如下
- import requests
- from lxml import etree
-
- url_base='https://apps.webofknowledge.com'
- url_test='https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=6C4xx6Mer35kGYw4PU7&page=1&doc=50'
- url_head={
- 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.57'
- }
- session=requests.session();
- res=session.get(url_test,headers=url_head)
- res_html = etree.HTML(res.text)
- url_cited_post=res_html.xpath('//a[@title="View all of the articles that cite this one"]/@href')#如果把@herf替换为text(),获取不到信息,因为a标签只有属性没有文本
- print(url_cited_post[0])
- url_allcited=url_base+url_cited_post[0]
- res=session.get(url_allcited,headers=url_head)
- res_html = etree.HTML(res.text)
- # url_cited_post=res_html.xpath('//div[contains(@id,"RECORD_")]')
- url_cited_post=res_html.xpath('//div[contains(@id,"summaryRecordsTable")]//div[contains(@id,"RECORD_")]')
- print('end')
-
倒数第三行,井号注释哪一行可以正常输出,但是加上前缀后,如倒数第二行,就不能正常输出了,不知道是怎么回事