怎么能有效的抓取10-k的MD&A部分

提问：怎么能有效的抓取10-k的MD&A部分

Task：本地已下载多家公司10k的htm文件，需要抓取SEC 10k年报里的Item7管理层讨论部分做文本分析, 不考虑使用SEC的api
存在难点：

htm文档内部标签格式不同尚不统一，同家公司不同年份甚至都会存在不同的格式
以item 7和item 7A/8的标题为查询界定方式但是htm里存在目录也有item7和item7A/8的标题，不知道是不是这个原因下面的codes有时会抓取10-k目录第二页的文本内容
item 7等标题的方式也存在多种情况 e.g. Item 7, ITEM 7, ITEM_7, Item 7
以下代码有参考一个帖子的代码进行修改：
方法一：匹配精确度高但是匹配成功的10k文件少

import os
import re
from bs4 import BeautifulSoup
import csv
from pathlib import Path


class TenKScraper:
    def __init__(self, section, next_section):
        self.all_section = [str(i) for i in range(1, 16)] + ['1A', '1B', '7A', '9A', '9B']
        section_num = re.findall(r'\d.*\w*', section.upper())[0]
        next_section_num = re.findall(r'\d.*\w*', next_section.upper())[0]
        if section_num not in self.all_section:
            raise ValueError(f'Section: {section_num} is not available, available sections: {self.all_section}')
        if next_section_num not in self.all_section:
            raise ValueError(f'Section: {next_section_num} is not available, available sections: {self.all_section}')
        self.section = 'Item ' + section_num
        self.next_section = 'Item ' + next_section_num
        self.section_upper = 'ITEM ' + section_num
        self.next_section_upper = 'ITEM ' + next_section_num

    def scrape_folder_to_csv(self, folder_path, output_csv):
        # Prepare the CSV file
        with open(output_csv, mode='w', newline='', encoding='utf-8') as csv_file:
            writer = csv.writer(csv_file)
            writer.writerow(['Filename', 'Extracted Content'])  # CSV header

            # Iterate over all files in the folder
            for root, _, files in os.walk(folder_path):
                for file in files:
                    if file.endswith('.htm') or file.endswith('.html'):
                        input_path = os.path.join(root, file)
                        print(f"Processing file: {input_path}")
                        content = self.scrape(input_path)
                        if content:
                            writer.writerow([file, content])

    def scrape(self, input_path):
        try:
            with open(input_path, 'rb') as input_file:
                page = input_file.read()
                page = page.strip().replace(b'\n', b' ').replace(b'\r', b'').replace(b'&nbsp;', b' ').replace(b'&#160;', b' ')
                while b'  ' in page:
                    page = page.replace(b'  ', b' ')

            regexs = [
                # 针对 Item 的正则表达式
                bytes(r'(?i)<(?:span|b)[^>]*>\s*' + re.escape(self.section) + r'\.?\s*(.*?)<(?:span|b)[^>]*>\s*' + re.escape(self.next_section) + r'\.?', encoding='utf-8'),
                bytes(r'(?i)' + re.escape(self.section) + r'\.\s*(.*?)' + re.escape(self.next_section) + r'\.', encoding='utf-8'),
                bytes(r'bold;\">\s*' + self.section + r'\.(.+?)bold;\">\s*' + self.next_section + r'\.', encoding='utf-8'),
                bytes(r'b>\s*' + self.section + r'\.(.+?)b>\s*' + self.next_section + r'\.', encoding='utf-8'),
                bytes(r'' + self.section + r'\.\s*<\/b>(.+?)' + self.next_section + r'\.\s*<\/b>', encoding='utf-8'),
                bytes(r'' + self.section + r'\.\s*[^<>]+\.\s*<\/b(.+?)' + self.next_section + r'\.\s*[^<>]+\.\s*<\/b', encoding='utf-8'),
                bytes(r'b>\s*<font[^>]+>\s*' + self.section + r'\.(.+?)b>\s*<font[^>]+>\s*' + self.next_section + r'\.', encoding='utf-8'),
                bytes(r'' + self.section.upper() + r'\.\s*<\/b>(.+?)' + self.next_section.upper() + r'\.\s*<\/b>', encoding='utf-8'),
                bytes(r'' + self.section + r'\.\s+<\/b>(.+?)' + self.next_section + r'\.\s+<\/b>', encoding='utf-8'),
                bytes(r'' + self.section + r'\.\s*<[^>]+>(.+?)' + self.next_section + r'\.\s*<[^>]+>', encoding='utf-8'),
                bytes(r'' + self.section + r'\.\s*(.+?)' + self.next_section + r'\.\s*', encoding='utf-8'),
                bytes(r'(?i)<div[^>]*>\s*<span[^>]*>\s*' + re.escape(self.section) + r'\.?\s*</span>(.*?)<div[^>]*>\s*<span[^>]*>\s*' + re.escape(self.next_section) + r'\.?\s*</span>', encoding='utf-8'),
                bytes(r'(?i)<div[^>]*>\s*' + re.escape(self.section) + r'\.?\s*(.*?)<div[^>]*>\s*' + re.escape(self.next_section) + r'\.?\s*', encoding='utf-8'),
                bytes(r'(?i)<span[^>]*>\s*' + re.escape(self.section) + r'\.?\s*(.*?)<span[^>]*>\s*' + re.escape(self.next_section) + r'\.?\s*', encoding='utf-8'),
                bytes(r'(?i)' + re.escape(self.section) + r'\.\s*(.*?)' + re.escape(self.next_section) + r'\.', encoding='utf-8'),
                # 新增：Item 在 <p> 标签里的情况
                bytes(r'(?i)<p[^>]*>\s*' + re.escape(self.section) + r'\.?\s*(.*?)<\/p>', encoding='utf-8'),
                bytes(r'(?i)<p[^>]*>\s*' + re.escape(self.section) + r'\.?\s*(.*?)<p[^>]*>\s*' + re.escape(self.next_section) + r'\.?\s*', encoding='utf-8'),
                # 新增：ITEM 大写且在 <div> 的 <span> 里的情况
                bytes(r'(?i)<div[^>]*>\s*<span[^>]*>\s*' + re.escape(self.section_upper) + r'\.?\s*(.*?)<\/span><\/div>', encoding='utf-8'),
                bytes(r'(?i)<div[^>]*>\s*<span[^>]*>\s*' + re.escape(self.section_upper) + r'\.?\s*(.*?)<div[^>]*>\s*<span[^>]*>\s*' + re.escape(self.next_section_upper) + r'\.?\s*', encoding='utf-8'),
                bytes(r'(?i)<p[^>]*>\s*' + re.escape(self.section) + r'\.?\s*(.*?)<\/p>', encoding='utf-8')
            ]

            match = None
            for regex in regexs:
                match = re.search(regex, page, flags=re.IGNORECASE | re.DOTALL)
                if match:
                    break

            if match:
                html_content = match.group(1).decode('utf-8')
                soup = BeautifulSoup(html_content, 'lxml')
                # 去除多余的空白字符
                content = re.sub(r'\s+', ' ', soup.get_text()).strip()
                return content
            else:
                print(f"No content found between {self.section} and {self.next_section} in {input_path}.")
                return None

        except Exception as e:
            print(f"Error processing {input_path}: {e}")
            return None

方法二：匹配精确度差但是匹配10k文档能抓取到文本内容的数量多

import os
from bs4 import BeautifulSoup
import re
import csv


class TenKScraper:
    def __init__(self, section, next_section):
        self.all_section = [str(i) for i in range(1, 16)] + ['1A', '1B', '7A', '9A', '9B']
        section_num = re.findall(r'\d.*\w*', section.upper())[0]
        next_section_num = re.findall(r'\d.*\w*', next_section.upper())[0]
        if section_num not in self.all_section:
            raise ValueError(f'Section: {section_num} is not available, available sections: {self.all_section}')
        if next_section_num not in self.all_section:
            raise ValueError(f'Section: {next_section_num} is not available, available sections: {self.all_section}')
        self.section = section_num
        self.next_section = next_section_num

    def generate_patterns(self, section):
        # 生成匹配不同格式的正则表达式模式
        patterns = [
            rf'(?i)Item\s*{section}',
            rf'(?i)ITEM\s*{section}',
            rf'(?i)Item_{section}',
            rf'(?i)ITEM_{section}',
            rf'(?i)Item{section}',
            rf'(?i)ITEM{section}'
        ]
        return patterns

    def find_start_tag(self, soup):
        # 查找起始标签
        start_patterns = self.generate_patterns(self.section)
        for pattern in start_patterns:
            start_tags = soup.find_all(lambda tag: tag.name in ['p', 'span', 'div', 'b', 'strong'] and re.search(pattern, tag.get_text()))
            if start_tags:
                return start_tags[0]
        return None

    def find_end_tag(self, next_tag):
        # 查找结束标签
        end_patterns = self.generate_patterns(self.next_section)
        for pattern in end_patterns:
            if next_tag.name in ['p', 'span', 'div', 'b', 'strong'] and re.search(pattern, next_tag.get_text()):
                return True
        return False

    def skip_table_of_contents(self, soup):
        # 跳过目录页
        toc_patterns = [r'(?i)Table of Contents', r'(?i)Table of Contents']
        for pattern in toc_patterns:
            toc_tags = soup.find_all(lambda tag: tag.name in ['p', 'span', 'div', 'b', 'strong'] and re.search(pattern, tag.get_text()))
            if toc_tags:
                last_toc_tag = toc_tags[-1]
                next_tag = last_toc_tag.find_next_sibling()
                if next_tag:
                    return next_tag
        return soup.find()

    def scrape(self, input_path):
        try:
            with open(input_path, 'r', encoding='utf-8') as input_file:
                html_content = input_file.read()
            soup = BeautifulSoup(html_content, 'lxml')
            # 跳过目录页
            start_search_tag = self.skip_table_of_contents(soup)
            start_tag = self.find_start_tag(BeautifulSoup(str(start_search_tag), 'lxml'))
            if not start_tag:
                return None
            content = []
            next_tag = start_tag.find_next_sibling()
            while next_tag:
                if self.find_end_tag(next_tag):
                    break
                content.append(next_tag.get_text(strip=True))
                next_tag = next_tag.find_next_sibling()
            return ' '.join(content)
        except Exception as e:
            print(f"Error processing {input_path}: {e}")
            return None

    def scrape_folder_to_csv(self, folder_path, output_csv):
        results = []
        for root, dirs, files in os.walk(folder_path):
            for file in files:
                if file.endswith('.htm') or file.endswith('.html'):
                    file_path = os.path.join(root, file)
                    item_content = self.scrape(file_path)
                    if item_content is not None:
                        results.append((file, item_content))
                    else:
                        results.append((file, '未找到相关内容'))
        with open(output_csv, mode='w', newline='', encoding='utf-8') as csv_file:
            writer = csv.writer(csv_file)
            writer.writerow(['文件名', f'Item {self.section}内容'])
            for file, content in results:
                writer.writerow([file, content])


if __name__ == '__main__':
    folder_path = ''
    output_csv = 'test.csv'
    scraper = TenKScraper('7', '7A')
    scraper.scrape_folder_to_csv(folder_path, output_csv)

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

4条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
阿里嘎多学长 2025-05-04 21:38
关注
阿里嘎多学长整理AIGC生成，因移动端显示问题导致当前答案未能完全显示，请使用PC端查看更加详细的解答过程

解答

抓取10-k的MD&A部分可以使用Python和BeautifulSoup库来实现。下面是一个简单的示例代码：

import os import re from bs4 import BeautifulSoup # 设置文件路径 file_path = 'path/to/your/10k_files' # 遍历文件夹 for file in os.listdir(file_path): if file.endswith('.htm'): # 读取文件 with open(os.path.join(file_path, file), 'r') as f: html = f.read() # 使用BeautifulSoup解析HTML soup = BeautifulSoup(html, 'html.parser') # 找到Item 7的标签 item7 = soup.find('div', {'class': 'item7'}) # 如果找到Item 7，提取MD&A部分 if item7: md_a = item7.get_text() print(md_a) else: print(f'Failed to find Item 7 in {file}')

这个代码遍历文件夹，读取每个HTML文件，然后使用BeautifulSoup库解析HTML，找到Item 7的标签，并提取MD&A部分。

需要注意的是，这个代码假设Item 7的标签是 <div class="item7">，实际情况可能会有所不同。同时，这个代码也没有考虑HTML文件的结构和内容的复杂性，可能需要根据实际情况进行调整。

另外，如果你需要对MD&A部分进行文本分析，可以使用NLTK、spaCy等自然语言处理库来实现。
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

java 基础--网络编程
2023-06-21 11:44

qq_35114326的博客 java 网络编程基础知识
xhs-小红书数据采集python算法还原
2023-05-14 20:32

这表明该项目主要使用Python作为编程语言，利用爬虫技术来抓取网页数据。"JavaScript逆向"是指在小红书的数据抓取过程中，可能遇到了前端JavaScript加密或混淆的情况，需要通过逆向工程来理解并解密这些JavaScript...
计算机笔记--【并发编程①】
2022-05-26 17:05

MrZhang_JAVAer的博客 Java并发编程
Python爬虫实战：使用最新技术栈抓取京东商品信息
2026-01-05 10:34

Python爬虫项目的博客在电商数据分析和价格监控领域，京东商品信息的抓取具有重要意义。本文将详细介绍如何使用Python最新技术栈构建高效、稳定的京东商品爬虫，涵盖异步请求、反爬对抗、数据解析等核心环节。: 最新Python版本，性能优化...
GitHub开源项目周报 · 2026年第17周（2026-04-20 ~ 2026-04-26） · AI编程工具热潮：Claude Code生态爆发
2026-04-27 07:51

开源早知道的博客本期榜单主要项目涵盖了AI辅助编程、金融分析、安全测试、运维监控等多个领域，其中AI编程工具类项目最为抢眼，占了近半数席位。本期出现了多个围绕Claude Code打造的工具生态项目，形成了活跃的开发者社区。本期...
Other-Website-Contents.md
2018-09-26 11:11

非主流科学家的博客 sticky: 10 toc: true keywords: 机器学习基础深度学习基础人工智能数学知识机器学习入门 date: 9999-12-31 23:59:59 本站包含作者原创的关于人工智能的理论，算法等博客，目前包括：强化学习...
(四) OpenClaw-Skill技能全面了解和自定义技能
2026-02-22 16:58

AZ-直到世界的尽头的博客 Skills作为标准化指令集合，能让OpenClaw从单纯理解升级为实际执行任务，相比传统System Prompt具有模块化、可复用等优势。用户可通过Clawhub官方商店、GitHub社区等渠道获取5700+现成技能，或按需开发自定义Skill。...
农产品销售预测系统的设计与实现-计算机毕业设计源码33871
2025-12-21 10:12

天穹科技先锋的博客农产品销售受季节、区域及消费行为等多因素影响，数据特征复杂，难以依靠传统手段进行有效预测与推荐。本文设计并实现了一种基于Hadoop平台的农产品销售预测系统，采用K-means聚类矩阵算法对销售数据进行归类分析，...
购物平台数据抓取实战指南：从API到深度分析
2024-10-29 14:42

爱搞技术的猫猫的博客本实战指南从API接口出发，详细介绍了购物平台数据的抓取、处理及深度分析过程。通过掌握这些技能，您将能够更好地了解市场趋势、消费者行为及竞争对手情况，为企业的决策和发展提供有力支持。希望本指南对您有所...
AI编程系列——mcp与skill
2025-12-16 21:42

云闲不收的博客 mcp是啥 Model Context Protocol MCP：AI Agent 工具托管协议及应用简单来说就是让ai可以调用外部服务，比如你们公司的cicd功能、让你部署的deepseek连上A股实时行情变成你的ai炒股小助理抓取网页爬虫…… MCP ...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 5月4日

怎么能有效的抓取10-k的MD&A部分

4条回答 默认 最新

解答

问题事件

4条回答默认最新