zesi111
2020-10-21 11:53
采纳率: 100%
浏览 266
已采纳

python实现多个网页链接的表格循环获取功能

问题描述:实现多个网页的表格获取,并保存为excel表,excel表分页sheet分别以XX
命名,对应于上面的获取表格的名称。
已实现功能:单个页面表格的读取和保存;
未实现功能:循环网页,sheet改名。
请大神帮忙改一下代码,看看如何实现。
提供的两个网址如下:
1.https://www.marketbeat.com/stocks/NYSE/BILL/institutional-ownership/
2.https://www.marketbeat.com/stocks/NYSE/SPG/institutional-ownership/
其中,BIll和SPG是可以做成循环网址读取的,表格sheet命名也是与此对应的。
目前已实现代码如下:

import requests
from bs4 import BeautifulSoup
import xlwt

# 请求headers 模拟谷歌浏览器访问
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}


def get_data():
    response = requests.get('https://www.marketbeat.com/stocks/NYSE/BILL/institutional-ownership/', headers=headers)
    bs = BeautifulSoup(response.text, 'lxml')

    # 标题处理
    title = bs.find_all('th')
    data_list_title = []  # 定义一个空列表
    for data in title:
        data_list_title.append(data.text.strip())  # 获取标签的内容去掉两边空格并添加到列表里

    # 内容处理
    content = bs.find_all('td')
    data_list_content = []  # 定义一个空列表
    for data in content:
        data_list_content.append(data.text.strip())  # 获取标签的内容去掉两边空格并添加到列表里
    # 语句featList = [example[i] for example in dataSet]作用为: 将dataSet中的数据按行依次放入example中,然后取得example中的example[i]元素,放入列表featList中
    del(data_list_content[80])
    print(*data_list_content,sep='\n')

    new_list = [data_list_content[i:i+7] for i in range(0, len(data_list_content)-17,8)]

    # 存入excel表格
    book = xlwt.Workbook()
    sheet1 = book.add_sheet('sheet1', cell_overwrite_ok=True)

    # 标题存入
    heads = data_list_title[:]  # 将data_list_title第一位到最后一位赋值给heads
    ii = 0
    for head in heads:
        sheet1.write(0, ii, head)
        ii += 1

    # 内容录入
    i = 1
    for list in new_list:
        j = 0
        for data in list:
            sheet1.write(i, j, data)
            j += 1
        i += 1
    # 文件保存
    book.save('./data.xls')


print("全部完成")

# 调用
get_data()

已实现表格数据如下
Reporting Date Hedge Fund Shares Held Market Value % of Portfolio Quarterly Change in Shares Ownership in Company

10/9/2020 Envestnet Asset Management Inc. 2,174 $0.22M 0.0% N/A 0.003%

10/6/2020 Avitas Wealth Management LLC 5,410 $0.54M 0.2% +57.2% 0.007%

9/28/2020 Manchester Capital Management LLC 1,500 $0.14M 0.0% +50.0% 0.002%

9/22/2020 Atria Investments LLC 31,463 $2.84M 0.1% N/A 0.039%

9/15/2020 Two Sigma Advisers LP 5,800 $0.52M 0.0% N/A 0.007%

9/15/2020 Schonfeld Strategic Advisors LLC 62,798 $5.67M 0.1% N/A 0.078%

9/4/2020 Principal Financial Group Inc. 2,390 $0.22M 0.0% N/A 0.003%

8/27/2020 Neuberger Berman Group LLC 34,087 $3.08M 0.0% +88.1% 0.047%

8/25/2020 Nuveen Asset Management LLC 96,953 $8.75M 0.0% +192.5% 0.134%

8/20/2020 Charles Schwab Investment Management Inc. 95,013 $8.57M 0.0% N/A 0.131%

8/18/2020 Blackstone Group Inc 172,686 $15.58M 0.1% N/A 0.238%

8/17/2020 Engineers Gate Manager LP 5,927 $0.54M 0.0% N/A 0.008%

8/17/2020 California State Teachers Retirement System 27,675 $2.50M 0.0% +66.5% 0.038%

8/17/2020 Townsquare Capital LLC 40,538 $3.37M 0.2% N/A 0.056%

8/17/2020 Great West Life Assurance Co. Can 1,031 $93K 0.0% N/A 0.001%

8/17/2020 Public Employees Retirement System of Ohio 6,024 $0.54M 0.0% +49.0% 0.008%

8/17/2020 Sei Investments Co. 369,166 $33.30M 0.1% +3,354.7% 0.509%

8/17/2020 Capital Impact Advisors LLC 28,000 $2.53M 0.8% N/A 0.039%

8/17/2020 Private Advisor Group LLC 5,450 $0.49M 0.0% N/A 0.008%

  • 写回答
  • 好问题 提建议
  • 追加酬金
  • 关注问题
  • 邀请回答

1条回答 默认 最新

  • PythonJavaC++go 2020-10-21 12:35
    最佳回答
    import requests
    from bs4 import BeautifulSoup
    import xlwt
    
    # 请求headers 模拟谷歌浏览器访问
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }
    urls_dict = {
    " BILL" : " https://www.marketbeat.com/stocks/NYSE/BILL/institutional-ownership/"
    "SPG": "https://www.marketbeat.com/stocks/NYSE/SPG/institutional-ownership/"
    
    def get_data(url, filename):
        response = requests.get(url, headers=headers)
        bs = BeautifulSoup(response.text, 'lxml')
    
        # 标题处理
        title = bs.find_all('th')
        data_list_title = []  # 定义一个空列表
        for data in title:
            data_list_title.append(data.text.strip())  # 获取标签的内容去掉两边空格并添加到列表里
    
        # 内容处理
        content = bs.find_all('td')
        data_list_content = []  # 定义一个空列表
        for data in content:
            data_list_content.append(data.text.strip())  # 获取标签的内容去掉两边空格并添加到列表里
        # 语句featList = [example[i] for example in dataSet]作用为: 将dataSet中的数据按行依次放入example中,然后取得example中的example[i]元素,放入列表featList中
        del(data_list_content[80])
        print(*data_list_content,sep='\n')
    
        new_list = [data_list_content[i:i+7] for i in range(0, len(data_list_content)-17,8)]
    
        # 存入excel表格
        book = xlwt.Workbook()
        sheet1 = book.add_sheet('sheet1', cell_overwrite_ok=True)
    
        # 标题存入
        heads = data_list_title[:]  # 将data_list_title第一位到最后一位赋值给heads
        ii = 0
        for head in heads:
            sheet1.write(0, ii, head)
            ii += 1
    
        # 内容录入
        i = 1
        for list in new_list:
            j = 0
            for data in list:
                sheet1.write(i, j, data)
                j += 1
            i += 1
        # 文件保存
        book.save(filename + '.xls')
    
    
    
    
    # 调用
    for filename, url in urls_dict.items():
        get_data(url, filename)
    print("全部完成")
    
    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题