求帮下新手。。有关PYTHON3的基础爬虫类问题

import urllib.request
import os
import re

def into(url):
url="http://www.piaofang168.com/"
response=urllib.request.urlopen(url)
html=response.read().decode('utf-8')

print(html)
return html

def find(url):
findit=into(url).html
findit=re.compile('

(.*?)',re.S)
items=re.findall(find,html)
for item in items:
print (item)
f=open("a.txt","a")
f.write(item)
f.close()

就是它为什么连html都打印不出来，一开始是可以的，就是我用了DEF将它们包装后就运行不了了。。。。

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

N4A 2017-04-29 07:57

关注

第一次用，改一下排版
1

import urllib.request
import re


def into(url):
    #url = "http://www.piaofang168.com/"
    response = urllib.request.urlopen(url)
    html = response.read().decode('utf-8')

    print(html)
    return html


def find(url):
    #findit = into(url).html
    html = into(url)
    findit = re.compile('(.*?)', re.S)
    # items = re.findall(find, html)
    items = re.findall(findit, html)
    for item in items:
        print(item)
        f = open("a.txt", "a")
        f.write(item)
        f.close()

url = "http://www.piaofang168.com/"
if __name__ == '__main__':
    find(url)

#!/usr/bin/env python
#coding:utf-8
import urllib.request
from bs4 import BeautifulSoup


def parse_list(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    req = urllib.request.Request(url, headers=headers)
    page = urllib.request.urlopen(req, timeout=60)
    contents = page.read()
    soup = BeautifulSoup(contents, "lxml")
    for tag in soup.find_all('div', class_='content-list'):
        try:
            data_url = tag.h3.a.attrs['href']
        except AttributeError:
            print("error at:", tag.get_text())
        else:
            if verbose:
                print(data_url)
            parse_data(data_base_url+data_url)


def parse_data(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    req = urllib.request.Request(url, headers=headers)
    page = urllib.request.urlopen(req, timeout=60)
    try:
        contents = page.read().decode('UTF-8')
    except UnicodeDecodeError:
        print("UnicodeDecodeError: " + url)
    else:
        soup = BeautifulSoup(contents, "lxml")
        try:
            tag = soup.find('div', id='homepost')
            # if verbose:
            #     print(tag)
            title = tag.find('div', class_='toptit').h2.get_text()
            if verbose:
                print(title)
            trs_left = tag.find('table', class_="infotable").find_all('tr')
            if verbose:
                print(trs_left)
            read_num = trs_left[1].td.span.get_text()
            download_num = trs_left[2].td.span.get_text()
            download_points = trs_left[3].td.span.get_text()

        except AttributeError:
            print("error at:", url)
        else:
            write_data(title, read_num, download_num, download_points, url)


def write_data(title, read_num, download_num, download_points, url):
    f.write(title + "," + read_num + "," + download_num + "," + download_points + "," + url + "\n")


base_url = 'http://www.codeforge.cn/l/0/c/0/t/0/v/0/p/'
data_base_url = 'http://www.codeforge.cn'
f = open('data.csv', 'w')
verbose = False
if __name__ == '__main__':
    f.write("title, read_num, download_num, download_points, url \n")
    for i in range(1000):
        parse_list(base_url + str(i))
        f.flush()
        print("has finish %s" % str((i+1)*10))

展开全部

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(2条)

编辑

预览

报告相同问题？

关注问题

python 爬虫基础问题。 python
2022-04-03 00:11

回答 1 已采纳 request=urllib.request.Request(url)就是获取url 这个地址的网页内容存放到 request 里
python爬虫问题 python 爬虫
2022-10-09 03:41

回答 2 已采纳
如何解决python爬虫问题？ python 人工智能爬虫
2022-08-15 01:11

回答 1 已采纳应该是css选择器里面的规则不够明确，可改成href = selectors.css('div.container div div div ul li a::attr(href)').getall()
Python3入门破冰+爬虫.zip
2021-09-28 08:48

"入门破冰"通常指的是针对初学者的基础教程，旨在帮助新手快速掌握Python3的基本语法和概念。而"爬虫"则是Python应用的一个重要领域，主要用于自动抓取互联网上的数据。在Python3的入门阶段，你需要了解以下关键...
python爬虫selenium基础问题，异常报错 python selenium 爬虫
2021-08-04 02:07

回答 1 已采纳错误提示告诉你，你获取的内容的编码问题，你的程序是按GBK的编码方式取的内容，换种编码。
python爬虫位置问题 python 爬虫
2023-03-08 05:31

回答 2 已采纳该回答引用GPTᴼᴾᴱᴺᴬᴵ如果您想要提取 div class="detail-context"标签下所有的 tr 标签，并进一步提取每个 tr 中的 td 标签内的内容，可以使用以下代码： impo
关于使用python实现的网页爬虫程序卡死的问题 python 有问必答爬虫
2021-08-07 05:04

回答 3 已采纳你可以用time模块进行计时，每过10分钟先用os.system()重新打开程序，然后调用sys.exit()关闭旧进程如果有用，希望采纳哦~
0基础-一个月搞定Python分布爬虫视频教程
2018-12-02 08:50

对于想要学习Python分布式爬虫技术的新手而言，掌握Python的基础知识是至关重要的第一步。 1. **基本语法**：了解Python的基本数据类型（如整型、浮点型、字符串等）、变量定义、流程控制（如条件语句if-else、循环...
用python做爬虫遇到的问题 python 爬虫
2021-09-11 06:26

回答 2 已采纳
python爬虫使用selenium切换窗口问题 python selenium 有问必答爬虫
2022-03-18 04:30

回答 2 已采纳 driver.swith_to.window(driver.window_handles[1]),函数名写错了，不是swith是switch，少写了个c，改成：driver.switch_to.win
python爬虫爬取网页代码遇到了一些问题 python 爬虫
2022-08-17 09:07

回答 3 已采纳因为元素里的你要的内容是通过 ajax 请求动态加载的，可以浏览器抓包去看下，你想要的这条数据到底是哪个请求返回的，找到真正的请求，然后模拟发送就行了
python爬虫.docx
2024-07-02 02:48

### Python 爬虫知识点详解 #### 一、Python 语言概述 Python 是一种高级、通用、解释型编程语言，自1989年由 Guido van Rossum 创建以来，因其简单易学、功能强大和灵活多样的特性，在全球范围内广受欢迎。Python...
Python网络爬虫相关基础概念，新手必看！
2024-07-26 09:44

Python_trys的博客网络爬虫又称网络蜘蛛、网络机器人，它是一种按照一定的规则自动浏览、检索网页信息的程序或者脚本。网络爬虫能够自动请求网页，并将所需要的数据抓取下来。通过对抓取的数据进行处理，从而提取出有价值的信息。
python单线程爬虫源码加初级教程.rar
2023-01-29 02:23

Python单线程爬虫是初学者入门网络爬虫技术的一个好起点。在这个“python单线程爬虫源码加初级教程.rar”压缩包中，包含了一个初级教程和...总之，这个教程是迈向Python爬虫世界的第一步，祝你在学习旅程中收获满满！
自写python爬虫壁纸软件
2020-10-09 04:55

5. **学习与交流**：作为一个适合新手学习的项目，"自写Python爬虫壁纸软件"鼓励大家动手实践，并通过分享与讨论提升技能。你可以尝试优化爬虫的效率，增加功能，如按颜色、分辨率筛选壁纸，或者设计更个性化的用户...
没有解决我的问题, 去提问