u010636378 2022-09-17 21:20 采纳率: 100%
浏览 30
已结题

Python中scrapy.FormRequest老是返回400错误响应

问题遇到的现象和发生背景

最近在学习爬虫的时候,利用scrapy框架编程,做一个笔趣阁小说爬取文件的时候,由于需要带参数请求所以重新定义了首次请求方法,可以无论怎么测试,返回的都是400错误,如果header标头不加'Content-Length'的话,能正常链接但是返回的是错误页面
而原先没用scrapy框架自己用requests.post同样的地址、标头内容以及data却可以得到正确的内容

用代码块功能插入代码,请勿粘贴截图

这是scrapy.FormRequest方法的代码,但是运行了会报400错误

def start_requests(self):   #默认对start_urls列表里的每一条url发起get请求,如果想发起post请求,必须重写父类的start_requests方法
        search_name=input('请输入希望搜索小说的关键字:')
        search_name1=quote(search_name,'utf-8')
        data={'m':'search','key':search_name}
        start_uls=['http://www.biqugse.com/case.php']
        global header
        header = {  'Cookie':'obj=1; 796ab53acf966fbacf8f078ecd10a9ce=a%3A1%3A%7Bi%3A552%3Bs%3A29%3A%2234369962%7C%2A%7C%E7%BB%88%E7%AB%A0%E3%80%81%E6%96%B0%E4%B8%96%E7%95%8C%22%3B%7D; PHPSESSID=ibjjb23leokjq11k2f24q4rqv7; ac30dd80c4d7d9d53b73bdd8bb9aaf43=1; Hm_lvt_7a41ef5a4df2b47849f9945ac428a3df=1663060001,1663069368,1663115792,1663392512; Hm_lpvt_7a41ef5a4df2b47849f9945ac428a3df=1663404614',
                    'Content-Length': '31',
                    #'Transfer-Encoding':'chunked',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' }
       
        for url in start_uls:
           
            yield scrapy.FormRequest(url=url, headers=header,formdata=data,callback=self.parse)

这是原先requests.post方法,可以正常返回网页信息

def search_notes(): #搜索小说名字,并将搜索结果打印出来
    name = input('请输入小说名字:')
    url = 'http://www.biqugse.com/case.php'
    header = {
'Content-Length': '31',
'Cookie': 'obj=1; 796ab53acf966fbacf8f078ecd10a9ce=a%3A1%3A%7Bi%3A552%3Bs%3A29%3A%2234369962%7C%2A%7C%E7%BB%88%E7%AB%A0%E3%80%81%E6%96%B0%E4%B8%96%E7%95%8C%22%3B%7D; PHPSESSID=ibjjb23leokjq11k2f24q4rqv7; ac30dd80c4d7d9d53b73bdd8bb9aaf43=1; Hm_lvt_7a41ef5a4df2b47849f9945ac428a3df=1663060001,1663069368,1663115792,1663392512; Hm_lpvt_7a41ef5a4df2b47849f9945ac428a3df=1663404614',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' }
    data={'m':'search','key':name}
    req = requests.post(url,headers = header,data=data)

运行结果及报错内容

未注释掉Content-Length的信息
2022-09-17 21:15:41 [scrapy.core.engine] DEBUG: Crawled (400) <POST http://www.biqugse.com/case.php> (referer: None)
2022-09-17 21:15:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://www.biqugse.com/case.php>: HTTP status code is not handled or not allowed
2022-09-17 21:15:41 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-17 21:15:41 [scrapy.core.engine] ERROR: Scraper close failure
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-packages\twisted\internet\defer.py", line 891, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "E:\learncode\code\daima\biqu\biqu\pipelines.py", line 29, in close_spider
jar_url=os.path.join(r'E:\learncode\code\daima\biqu',file_name)
NameError: name 'file_name' is not defined
2022-09-17 21:15:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 401,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 296,
'downloader/response_count': 1,
'downloader/response_status_count/400': 1,
'elapsed_time_seconds': 3.882993,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 9, 17, 13, 15, 41, 331610),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/400': 1,
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 9, 17, 13, 15, 37, 448617)}
2022-09-17 21:15:41 [scrapy.core.engine] INFO: Spider closed (finished)
注释掉Content-Length返回的网页

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="MobileOptimized" content="240"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0"/>
<meta http-equiv="Cache-Control" content="max-age=0"/>
<meta http-equiv="Cache-Control" content="no-transform "/>
<style type="text/css">
body{font-size:13px;}
</style>
</head>
<body>
<script type="text/javascript">
  var sec = 2;
  var t = setInterval(function(){
    sec = sec - 1;    
    if(sec < 1){
        clearInterval(t);
        return;
    }
    document.getElementById('seconds').innerHTML = sec;
  },1000);
  setTimeout(function(){
    window.location.href = "";
  },2000);
</script>
<div style="padding-top:10px;text-align:left;line-height:25px;">
    <table align="center" width="300" bgcolor="#3399ff" cellpadding="1" cellspacing="1">
        <tr bgcolor="#e1f0fd"><td width="95" align="center">提示信息:</td><td><strong style="color:red;">请刷新后,重新搜索!</strong></td></tr>
                <tr bgcolor="#e1f0fd"><td align="center">自动跳转:</td><td><span id="seconds">2</span>秒后自动跳转!</td></tr>
                <tr bgcolor="#e1f0fd"><td colspan="2" align="center"><a href="">立即跳转</a></td>
    </table>
</div>
</body>
</html>

我的解答思路和尝试过的方法

尝试过用scrapy.Requests方法,将body=json.dumps(data),也试过将cookies从标头中拿出来单独赋值并转换成字典结构,结果始终都是400,而将Content-Length注释掉的话,得到的都是400响应

  • 写回答

2条回答 默认 最新

  • CSDN-Ada助手 CSDN-AI 官方账号 2022-09-17 23:21
    关注
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

问题事件

  • 系统已结题 9月28日
  • 已采纳回答 9月20日
  • 创建了问题 9月17日

悬赏问题

  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题