问题遇到的现象和发生背景
最近在学习爬虫的时候,利用scrapy框架编程,做一个笔趣阁小说爬取文件的时候,由于需要带参数请求所以重新定义了首次请求方法,可以无论怎么测试,返回的都是400错误,如果header标头不加'Content-Length'的话,能正常链接但是返回的是错误页面
而原先没用scrapy框架自己用requests.post同样的地址、标头内容以及data却可以得到正确的内容
用代码块功能插入代码,请勿粘贴截图
这是scrapy.FormRequest方法的代码,但是运行了会报400错误
def start_requests(self): #默认对start_urls列表里的每一条url发起get请求,如果想发起post请求,必须重写父类的start_requests方法
search_name=input('请输入希望搜索小说的关键字:')
search_name1=quote(search_name,'utf-8')
data={'m':'search','key':search_name}
start_uls=['http://www.biqugse.com/case.php']
global header
header = { 'Cookie':'obj=1; 796ab53acf966fbacf8f078ecd10a9ce=a%3A1%3A%7Bi%3A552%3Bs%3A29%3A%2234369962%7C%2A%7C%E7%BB%88%E7%AB%A0%E3%80%81%E6%96%B0%E4%B8%96%E7%95%8C%22%3B%7D; PHPSESSID=ibjjb23leokjq11k2f24q4rqv7; ac30dd80c4d7d9d53b73bdd8bb9aaf43=1; Hm_lvt_7a41ef5a4df2b47849f9945ac428a3df=1663060001,1663069368,1663115792,1663392512; Hm_lpvt_7a41ef5a4df2b47849f9945ac428a3df=1663404614',
'Content-Length': '31',
#'Transfer-Encoding':'chunked',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' }
for url in start_uls:
yield scrapy.FormRequest(url=url, headers=header,formdata=data,callback=self.parse)
这是原先requests.post方法,可以正常返回网页信息
def search_notes(): #搜索小说名字,并将搜索结果打印出来
name = input('请输入小说名字:')
url = 'http://www.biqugse.com/case.php'
header = {
'Content-Length': '31',
'Cookie': 'obj=1; 796ab53acf966fbacf8f078ecd10a9ce=a%3A1%3A%7Bi%3A552%3Bs%3A29%3A%2234369962%7C%2A%7C%E7%BB%88%E7%AB%A0%E3%80%81%E6%96%B0%E4%B8%96%E7%95%8C%22%3B%7D; PHPSESSID=ibjjb23leokjq11k2f24q4rqv7; ac30dd80c4d7d9d53b73bdd8bb9aaf43=1; Hm_lvt_7a41ef5a4df2b47849f9945ac428a3df=1663060001,1663069368,1663115792,1663392512; Hm_lpvt_7a41ef5a4df2b47849f9945ac428a3df=1663404614',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' }
data={'m':'search','key':name}
req = requests.post(url,headers = header,data=data)
运行结果及报错内容
未注释掉Content-Length的信息
2022-09-17 21:15:41 [scrapy.core.engine] DEBUG: Crawled (400) <POST http://www.biqugse.com/case.php> (referer: None)
2022-09-17 21:15:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://www.biqugse.com/case.php>: HTTP status code is not handled or not allowed
2022-09-17 21:15:41 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-17 21:15:41 [scrapy.core.engine] ERROR: Scraper close failure
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-packages\twisted\internet\defer.py", line 891, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "E:\learncode\code\daima\biqu\biqu\pipelines.py", line 29, in close_spider
jar_url=os.path.join(r'E:\learncode\code\daima\biqu',file_name)
NameError: name 'file_name' is not defined
2022-09-17 21:15:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 401,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 296,
'downloader/response_count': 1,
'downloader/response_status_count/400': 1,
'elapsed_time_seconds': 3.882993,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 9, 17, 13, 15, 41, 331610),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/400': 1,
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 9, 17, 13, 15, 37, 448617)}
2022-09-17 21:15:41 [scrapy.core.engine] INFO: Spider closed (finished)
注释掉Content-Length返回的网页
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="MobileOptimized" content="240"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0"/>
<meta http-equiv="Cache-Control" content="max-age=0"/>
<meta http-equiv="Cache-Control" content="no-transform "/>
<style type="text/css">
body{font-size:13px;}
</style>
</head>
<body>
<script type="text/javascript">
var sec = 2;
var t = setInterval(function(){
sec = sec - 1;
if(sec < 1){
clearInterval(t);
return;
}
document.getElementById('seconds').innerHTML = sec;
},1000);
setTimeout(function(){
window.location.href = "";
},2000);
</script>
<div style="padding-top:10px;text-align:left;line-height:25px;">
<table align="center" width="300" bgcolor="#3399ff" cellpadding="1" cellspacing="1">
<tr bgcolor="#e1f0fd"><td width="95" align="center">提示信息:</td><td><strong style="color:red;">请刷新后,重新搜索!</strong></td></tr>
<tr bgcolor="#e1f0fd"><td align="center">自动跳转:</td><td><span id="seconds">2</span>秒后自动跳转!</td></tr>
<tr bgcolor="#e1f0fd"><td colspan="2" align="center"><a href="">立即跳转</a></td>
</table>
</div>
</body>
</html>
我的解答思路和尝试过的方法
尝试过用scrapy.Requests方法,将body=json.dumps(data),也试过将cookies从标头中拿出来单独赋值并转换成字典结构,结果始终都是400,而将Content-Length注释掉的话,得到的都是400响应