求帮下新手。。有关PYTHON3的基础爬虫类问题

import urllib.request
import os
import re

def into(url):
url="http://www.piaofang168.com/"
response=urllib.request.urlopen(url)
html=response.read().decode('utf-8')

print(html)
return html

def find(url):
findit=into(url).html
findit=re.compile('

(.*?)',re.S)
items=re.findall(find,html)
for item in items:
print (item)
f=open("a.txt","a")
f.write(item)
f.close()

就是它为什么连html都打印不出来，一开始是可以的，就是我用了DEF将它们包装后就运行不了了。。。。

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

N4A 2017-04-29 15:57

关注

第一次用，改一下排版
1

import urllib.request
import re


def into(url):
    #url = "http://www.piaofang168.com/"
    response = urllib.request.urlopen(url)
    html = response.read().decode('utf-8')

    print(html)
    return html


def find(url):
    #findit = into(url).html
    html = into(url)
    findit = re.compile('(.*?)', re.S)
    # items = re.findall(find, html)
    items = re.findall(findit, html)
    for item in items:
        print(item)
        f = open("a.txt", "a")
        f.write(item)
        f.close()

url = "http://www.piaofang168.com/"
if __name__ == '__main__':
    find(url)

#!/usr/bin/env python
#coding:utf-8
import urllib.request
from bs4 import BeautifulSoup


def parse_list(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    req = urllib.request.Request(url, headers=headers)
    page = urllib.request.urlopen(req, timeout=60)
    contents = page.read()
    soup = BeautifulSoup(contents, "lxml")
    for tag in soup.find_all('div', class_='content-list'):
        try:
            data_url = tag.h3.a.attrs['href']
        except AttributeError:
            print("error at:", tag.get_text())
        else:
            if verbose:
                print(data_url)
            parse_data(data_base_url+data_url)


def parse_data(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    req = urllib.request.Request(url, headers=headers)
    page = urllib.request.urlopen(req, timeout=60)
    try:
        contents = page.read().decode('UTF-8')
    except UnicodeDecodeError:
        print("UnicodeDecodeError: " + url)
    else:
        soup = BeautifulSoup(contents, "lxml")
        try:
            tag = soup.find('div', id='homepost')
            # if verbose:
            #     print(tag)
            title = tag.find('div', class_='toptit').h2.get_text()
            if verbose:
                print(title)
            trs_left = tag.find('table', class_="infotable").find_all('tr')
            if verbose:
                print(trs_left)
            read_num = trs_left[1].td.span.get_text()
            download_num = trs_left[2].td.span.get_text()
            download_points = trs_left[3].td.span.get_text()

        except AttributeError:
            print("error at:", url)
        else:
            write_data(title, read_num, download_num, download_points, url)


def write_data(title, read_num, download_num, download_points, url):
    f.write(title + "," + read_num + "," + download_num + "," + download_points + "," + url + "\n")


base_url = 'http://www.codeforge.cn/l/0/c/0/t/0/v/0/p/'
data_base_url = 'http://www.codeforge.cn'
f = open('data.csv', 'w')
verbose = False
if __name__ == '__main__':
    f.write("title, read_num, download_num, download_points, url \n")
    for i in range(1000):
        parse_list(base_url + str(i))
        f.flush()
        print("has finish %s" % str((i+1)*10))

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(2条)

报告相同问题？

关注问题

求帮下新手。。有关PYTHON3的基础爬虫类问题 python 爬虫
2017-04-29 13:36

回答 3 已采纳第一次用，改一下排版 1 ```python import urllib.request import re def into(url): #url = "http:/
python 爬虫基础问题。 python
2022-04-03 08:11

回答 1 已采纳 request=urllib.request.Request(url)就是获取url 这个地址的网页内容存放到 request 里
python爬虫问题 python 爬虫
2022-10-09 11:41

回答 2 已采纳
【Python爬虫+tkinter实战】m3u8下载器 ts自动合成mp4文件
2022-01-31 23:37

5. 新手可从中学习的部分: Python爬虫爬到了 ts文件如何合成MP4; Python类的基本使用;爬虫的基本操作; tkinter中Label,messagebox,Entry, Button的基本使用; Python文件操作等 6. 祝福：兄弟们注意身体
如何解决python爬虫问题？ python 人工智能爬虫
2022-08-15 09:11

回答 1 已采纳应该是css选择器里面的规则不够明确，可改成href = selectors.css('div.container div div div ul li a::attr(href)').getall()
python爬虫selenium基础问题，异常报错 python selenium 爬虫
2021-08-04 10:07

回答 1 已采纳错误提示告诉你，你获取的内容的编码问题，你的程序是按GBK的编码方式取的内容，换种编码。
python爬虫位置问题 python 爬虫
2023-03-08 13:31

回答 2 已采纳该回答引用GPTᴼᴾᴱᴺᴬᴵ如果您想要提取 div class="detail-context"标签下所有的 tr 标签，并进一步提取每个 tr 中的 td 标签内的内容，可以使用以下代码： impo
Python3爬虫(一)：Python的入门学习以及Python网络爬虫的初步认识
2022-09-21 14:19

TT图图的博客最新的稳定版的Python版本是3.8.2，由于自己电脑上的Python是以前装的，所以在这不阐述如何安装，如果不会安装可以看菜鸟教程的教程：https://www.runoob.com/python3/python3-install.html。其实是urlopen返回的是...
关于使用python实现的网页爬虫程序卡死的问题 python 有问必答爬虫
2021-08-07 13:04

回答 3 已采纳你可以用time模块进行计时，每过10分钟先用os.system()重新打开程序，然后调用sys.exit()关闭旧进程如果有用，希望采纳哦~
用python做爬虫遇到的问题 python 爬虫
2021-09-11 14:26

回答 2 已采纳
python爬虫使用selenium切换窗口问题 python selenium 有问必答爬虫
2022-03-18 12:30

回答 2 已采纳 driver.swith_to.window(driver.window_handles[1]),函数名写错了，不是swith是switch，少写了个c，改成：driver.switch_to.win
Python爬虫解析（新手快速入门）
2022-03-28 17:02

捣蛋深的博客由于参加数学建模的需要，在这个寒假期间小学了一下爬虫（Python学习），想着我记性这么差，还是得对这段时间的学习进行整理，以防忘记。一、爬虫介绍网络爬虫又称网络蜘蛛、网络机器人，是指按照某种规则在网络上...
python爬虫爬取网页代码遇到了一些问题 python 爬虫
2022-08-17 17:07

回答 3 已采纳因为元素里的你要的内容是通过 ajax 请求动态加载的，可以浏览器抓包去看下，你想要的这条数据到底是哪个请求返回的，找到真正的请求，然后模拟发送就行了
大数据必修课 Python基础入门教程 Python自学资料课件-第1章 Python3概述共51页.pptx
2023-04-15 15:45

清华大学出品的Python课件，非常适合Python新手，也适合老鸟复习回顾，完全可用于自学入门第1章 Python3概述.pptx 第2章 Python基本语法.pptx 第3章 Python流程控制.pptx 第4章 Python组合数据类型.pptx 第5章 ...
Python3 网络爬虫入门，爬虫从入门到精通，看这一篇就够了
2023-11-22 20:05

EnjoyEDU的博客很多朋友学习Python都是先从爬虫开始，其原因不外两方面：其一Python对爬虫支持度较好，类库众多，其二语法简单，入门容易，所以两者形影相随，
没有解决我的问题, 去提问

悬赏问题

¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 对于相关问题的求解与代码
¥15 ubuntu子系统密码忘记
¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
¥15 保护模式-系统加载-段寄存器
¥15 电脑桌面设定一个区域禁止鼠标操作
¥15 求NPF226060磁芯的详细资料

码龄粉丝数原力等级 --

求帮下新手。。有关PYTHON3的基础爬虫类问题

3条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

求帮下新手。。有关PYTHON3的基础爬虫类问题

3条回答 默认 最新

悬赏问题

3条回答默认最新