2301_76523335 2023-02-24 11:31 采纳率: 100%
浏览 112
已结题

python爬虫运行成功但是数据没有输出

我做一个爬取pubmed的爬虫来爬取文章标题和链接。
可是能正常运行但是怕去不了数据,我测试后他说未发现文章。可是我用之前的一个用过的爬虫爬取同样的网站,他能爬取出文章,这是为什么啊各位。
这个是运行但不出数据

import requests
from bs4 import BeautifulSoup

url = 'https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
}

def get_articles(url):
    # 发送HTTP请求,获取页面数据
    response = requests.get(url, headers=headers)
    html = response.text

    # 解析HTML代码
    soup = BeautifulSoup(html, 'html.parser')

    # 找到包含文章信息的标签
    article_tags = soup.select('.docsum-content')
    
    # 提取每篇文章的标题和链接
    results = []
    for tag in article_tags:
        title_tags = tag.select('.docsum-title > a')
        if title_tags:
            title = title_tags[0].get_text().strip()
            link = 'https://pubmed.ncbi.nlm.nih.gov' + title_tags[0]['href']
            results.append((title, link))
    
    return results

if __name__ == '__main__':
    for page in range(1, 6):
        page_url = f'{url}&page={page}'
        articles = get_articles(page_url)
        for article in articles:
            print(article[0])
            print(article[1])
            print('---')
if __name__ == '__main__':
    for page in range(1, 6):
        page_url = f'{url}&page={page}'
        articles = get_articles(page_url)
        print(f'Page {page}: {page_url} ({len(articles)} articles found)')
        for article in articles:
            print(article[0])
            print(article[1])
            print('---')
这是可以运行的代码
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023"
num_pages = 10

data = []

for i in range(num_pages):
    # Construct the URL for the current page
    page_url = f"{url}&page={i+1}"
    
    # Make a request to the page and parse the HTML using Beautiful Soup
    response = requests.get(page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all the articles on the current page
    articles = soup.find_all("div", class_="docsum-content")
    
    # Extract the title and link for each article and append to the data list
    for article in articles:
        title = article.find("a", class_="docsum-title").text.strip()
        link = article.find("a", class_="docsum-title")["href"]
        data.append([title, link])

df = pd.DataFrame(data, columns=["Title", "Link"])
df.to_excel("cervical_cancer_treatment.xlsx", index=False)


```

  • 写回答

8条回答 默认 最新

  • 程序猿_Mr. Guo 2023-02-24 12:23
    关注

    img

    选择a标签的时候错误了,应该是 title_tags = tag.select('a'),这样选择每一个a标签,因为 article_tags = soup.select('.docsum-content') 已经定位到具体的div了,遍历的时候只需要遍历下边的a标签就可以了

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
    }
    
    
    def get_articles(url):
        # 发送HTTP请求,获取页面数据
        response = requests.get(url, headers=headers)
        html = response.text
    
        # 解析HTML代码
        soup = BeautifulSoup(html, 'html.parser')
    
        # 找到包含文章信息的标签
        article_tags = soup.select('.docsum-content')
        # print('article_tags : ', article_tags)
    
        # 提取每篇文章的标题和链接
        results = []
        for tag in article_tags:
            title_tags = tag.select('a')
            if title_tags:
                title = title_tags[0].get_text().strip()
                link = 'https://pubmed.ncbi.nlm.nih.gov' + title_tags[0]['href']
                results.append((title, link))
    
        return results
    
    
    # if __name__ == '__main__':
    #     for page in range(1, 6):
    #         page_url = f'{url}&page={page}'
    #         articles = get_articles(page_url)
    #         for article in articles:
    #             print(article[0])
    #             print(article[1])
    #             print('---')
    if __name__ == '__main__':
        for page in range(1, 6):
            page_url = f'{url}&page={page}'
            articles = get_articles(page_url)
            print(f'Page {page}: {page_url} ({len(articles)} articles found)')
            for article in articles:
                print(article[0])
                print(article[1])
                print('---')
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论 编辑记录
查看更多回答(7条)

报告相同问题?

问题事件

  • 系统已结题 3月4日
  • 已采纳回答 2月24日
  • 创建了问题 2月24日

悬赏问题

  • ¥15 关于#java#的问题:找一份能快速看完mooc视频的代码
  • ¥15 这种微信登录授权 谁可以做啊
  • ¥15 请问我该如何添加自己的数据去运行蚁群算法代码
  • ¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”
  • ¥15 网络设备配置与管理这个该怎么弄
  • ¥20 机器学习能否像多层线性模型一样处理嵌套数据
  • ¥20 西门子S7-Graph,S7-300,梯形图
  • ¥50 用易语言http 访问不了网页
  • ¥50 safari浏览器fetch提交数据后数据丢失问题
  • ¥15 matlab不知道怎么改,求解答!!