2301_76523335 2023-02-24 11:31 采纳率: 100%
浏览 112
已结题

python爬虫运行成功但是数据没有输出

我做一个爬取pubmed的爬虫来爬取文章标题和链接。
可是能正常运行但是怕去不了数据,我测试后他说未发现文章。可是我用之前的一个用过的爬虫爬取同样的网站,他能爬取出文章,这是为什么啊各位。
这个是运行但不出数据

import requests
from bs4 import BeautifulSoup

url = 'https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
}

def get_articles(url):
    # 发送HTTP请求,获取页面数据
    response = requests.get(url, headers=headers)
    html = response.text

    # 解析HTML代码
    soup = BeautifulSoup(html, 'html.parser')

    # 找到包含文章信息的标签
    article_tags = soup.select('.docsum-content')
    
    # 提取每篇文章的标题和链接
    results = []
    for tag in article_tags:
        title_tags = tag.select('.docsum-title > a')
        if title_tags:
            title = title_tags[0].get_text().strip()
            link = 'https://pubmed.ncbi.nlm.nih.gov' + title_tags[0]['href']
            results.append((title, link))
    
    return results

if __name__ == '__main__':
    for page in range(1, 6):
        page_url = f'{url}&page={page}'
        articles = get_articles(page_url)
        for article in articles:
            print(article[0])
            print(article[1])
            print('---')
if __name__ == '__main__':
    for page in range(1, 6):
        page_url = f'{url}&page={page}'
        articles = get_articles(page_url)
        print(f'Page {page}: {page_url} ({len(articles)} articles found)')
        for article in articles:
            print(article[0])
            print(article[1])
            print('---')
这是可以运行的代码
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023"
num_pages = 10

data = []

for i in range(num_pages):
    # Construct the URL for the current page
    page_url = f"{url}&page={i+1}"
    
    # Make a request to the page and parse the HTML using Beautiful Soup
    response = requests.get(page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all the articles on the current page
    articles = soup.find_all("div", class_="docsum-content")
    
    # Extract the title and link for each article and append to the data list
    for article in articles:
        title = article.find("a", class_="docsum-title").text.strip()
        link = article.find("a", class_="docsum-title")["href"]
        data.append([title, link])

df = pd.DataFrame(data, columns=["Title", "Link"])
df.to_excel("cervical_cancer_treatment.xlsx", index=False)


```

  • 写回答

8条回答 默认 最新

  • 程序猿_Mr. Guo 2023-02-24 12:23
    关注

    img

    选择a标签的时候错误了,应该是 title_tags = tag.select('a'),这样选择每一个a标签,因为 article_tags = soup.select('.docsum-content') 已经定位到具体的div了,遍历的时候只需要遍历下边的a标签就可以了

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://pubmed.ncbi.nlm.nih.gov/?term=cervical%20cancer%20treatment&filter=years.2020-2023'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
    }
    
    
    def get_articles(url):
        # 发送HTTP请求,获取页面数据
        response = requests.get(url, headers=headers)
        html = response.text
    
        # 解析HTML代码
        soup = BeautifulSoup(html, 'html.parser')
    
        # 找到包含文章信息的标签
        article_tags = soup.select('.docsum-content')
        # print('article_tags : ', article_tags)
    
        # 提取每篇文章的标题和链接
        results = []
        for tag in article_tags:
            title_tags = tag.select('a')
            if title_tags:
                title = title_tags[0].get_text().strip()
                link = 'https://pubmed.ncbi.nlm.nih.gov' + title_tags[0]['href']
                results.append((title, link))
    
        return results
    
    
    # if __name__ == '__main__':
    #     for page in range(1, 6):
    #         page_url = f'{url}&page={page}'
    #         articles = get_articles(page_url)
    #         for article in articles:
    #             print(article[0])
    #             print(article[1])
    #             print('---')
    if __name__ == '__main__':
        for page in range(1, 6):
            page_url = f'{url}&page={page}'
            articles = get_articles(page_url)
            print(f'Page {page}: {page_url} ({len(articles)} articles found)')
            for article in articles:
                print(article[0])
                print(article[1])
                print('---')
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论 编辑记录
查看更多回答(7条)

报告相同问题?

问题事件

  • 系统已结题 3月4日
  • 已采纳回答 2月24日
  • 创建了问题 2月24日

悬赏问题

  • ¥15 关于#linux#的问题(输入输出错误):出现这个界面接着我重新装系统,又让修电脑的师傅帮我扫描硬盘(没有问题)用着用着又卡死(相关搜索:固态硬盘)
  • ¥15 cv::resize不同线程时间不同
  • ¥15 web课程,怎么做啊😭没好好听课 根本不知道怎么下手
  • ¥15 做一个关于单片机的比较难的代码,然后搞一个PPT进行解释
  • ¥15 python提取.csv文件中的链接会经常出现爬取失败
  • ¥15 数据结构中的数组地址问题
  • ¥15 maya的mel里,怎样先选择模型A,然后利用mel脚本自动选择有相同名字的模型B呢。
  • ¥15 Python题,根本不会啊
  • ¥15 会会信号与系统和python的来
  • ¥15 关于#python#的问题