目标URL:http://218.92.23.142/sjsz/szxx/Index.aspx(工作需要)
主要目的是爬取网站中的信件类型、信件主题、写信时间、回复时间、回复状态以及其中链接里面的具体内容,然后保存到excel表格中。里面的链接全部都是POST方法,没有出现一个具体的链接,所以我感觉非常恼火。
目前碰到的问题:
1、 但是我只能抓到第一条的信息,后面就抓不到了。具体是这条:市长您好: 我是一名事...
2、 scrapy运行后出现的信息是:
15:01:33 [scrapy] INFO: Scrapy 1.0.3 started (bot: spider2)
2016-01-13 15:01:33 [scrapy] INFO: Optional features available: ssl, http11
2016-01-13 15:01:33 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spider2.spiders', 'FEED_URI': u'file:///F:/\u5feb\u76d8/workspace/Pythontest/src/Scrapy/spider2/szxx.csv', 'SPIDER_MODULES': ['spider2.spiders'], 'BOT_NAME': 'spider2', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'CSV'}
2016-01-13 15:01:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-13 15:01:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-13 15:01:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-13 15:01:38 [scrapy] INFO: Enabled item pipelines:
2016-01-13 15:01:38 [scrapy] INFO: Spider opened
2016-01-13 15:01:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-13 15:01:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: None)
2016-01-13 15:01:39 [scrapy] DEBUG: Filtered duplicate request: - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
2016-01-13 15:01:39 [scrapy] DEBUG: Redirecting (302) to from
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
2016-01-13 15:01:39 [scrapy] DEBUG: Scraped from
第一条的信息(太多了,就省略了。。。。)
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
…………
后面的差不多,就不写出来了
2016-01-13 15:01:41 [scrapy] INFO: Stored csv feed (1 items) in: file:///F:/快盘/workspace/Pythontest/src/Scrapy/spider2/szxx.csv
2016-01-13 15:01:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 56383,
'downloader/request_count': 17,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 14,
'downloader/response_bytes': 118855,
'downloader/response_count': 17,
'downloader/response_status_count/200': 16,
'downloader/response_status_count/302': 1,
'dupefilter/filtered': 120,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 13, 7, 1, 41, 716000),
'item_scraped_count': 1,
'log_count/DEBUG': 20,
'log_count/INFO': 8,
'request_depth_max': 14,
'response_received_count': 16,
'scheduler/dequeued': 17,
'scheduler/dequeued/memory': 17,
'scheduler/enqueued': 17,
'scheduler/enqueued/memory': 17,
'start_time': datetime.datetime(2016, 1, 13, 7, 1, 38, 670000)}
2016-01-13 15:01:41 [scrapy] INFO: Spider closed (finished)
具体的代码如下(代码写的不好,误喷):
import sys, copy
reload(sys)
sys.setdefaultencoding('utf-8')
sys.path.append("../")
from scrapy.spiders import CrawlSpider
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
from items import Spider2Item
class Domeszxx(CrawlSpider):
name = "szxx"
allowed_domain = ["218.92.23.142"]
start_urls = ["http://218.92.23.142/sjsz/szxx/Index.aspx"]
item = Spider2Item()
def parse(self, response):
selector = Selector(response)
# 获得下一页的POST参数
viewstate = ''.join(selector.xpath('//input[@id="__VIEWSTATE"]/@value').extract()[0])
eventvalidation = ''.join(selector.xpath('//input[@id="__EVENTVALIDATION"]/@value').extract()[0])
nextpage = ''.join(
selector.xpath('//input[@name="ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage"]/@value').extract())
nextpage_data = {
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1$ctl12$cmdNext',
'__EVENTARGUMENT': '',
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': '9DEFE542',
'__EVENTVALIDATION': eventvalidation,
'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage
}
# 获得抓取当前内容的xpath
xjlx = ".//*[@id='ContentPlaceHolder1_GridView1_Label2_"
xjzt = ".//*[@id='ContentPlaceHolder1_GridView1_LinkButton5_"
xxsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label4_"
hfsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label5_"
nextlink = '//*[@id="ContentPlaceHolder1_GridView1_cmdNext"]/@href'
# 获取当前页面公开答复的行数
listnum = len(selector.xpath('//tr')) - 2
# 获得抓取内容
for i in range(0, listnum):
item_all = {}
xjlx_xpath = xjlx + str(i) + "']/text()"
xjzt_xpath = xjzt + str(i) + "']/text()"
xxsj_xpath = xxsj + str(i) + "']/text()"
hfsj_xpath = hfsj + str(i) + "']/text()"
# 信件类型
item_all['xjlx'] = selector.xpath(xjlx_xpath).extract()[0].decode('utf-8').encode('gbk')
# 信件主题
item_all['xjzt'] = str(selector.xpath(xjzt_xpath).extract()[0].decode('utf-8').encode('gbk')).replace('\n',
'')
# 写信时间
item_all['xxsj'] = selector.xpath(xxsj_xpath).extract()[0].decode('utf-8').encode('gbk')
# 回复时间
item_all['hfsj'] = selector.xpath(hfsj_xpath).extract()[0].decode('utf-8').encode('gbk')
# 获取二级页面中的POST参数
eventtaget = 'ctl00$ContentPlaceHolder1$GridView1$ctl0' + str(i + 2) + '$LinkButton5'
content_data = {
'__EVENTTARGET': eventtaget,
'__EVENTARGUMENT': '',
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': '9DEFE542',
'__EVENTVALIDATION': eventvalidation,
'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage
}
# 完成抓取信息的传递
yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.send_value,
meta={'item_all': item_all, 'content_data': content_data})
# 进入页面中的二级页面的链接,必须利用POST方法才能提交,无法看到直接的URL,同时将本页中抓取的item和进入下一页的POST方法进行传递
# yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.getcontent,
# meta={'item': item_all})
# yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=content_data,
# callback=self.getcontent)
# 进入下一页
if selector.xpath(nextlink).extract():
yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=nextpage_data,
callback=self.parse)
# 将当前页面的值传递到本函数并存入类的item中
def send_value(self, response):
itemx = response.meta['item_all']
post_data = response.meta['content_data']
Domeszxx.item = copy.deepcopy(itemx)
yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=post_data,
callback=self.getcontent)
return
# 将二级链接中值抓取并存入类的item中
def getcontent(self, response):
item_getcontent = {
'xfr': ''.join(response.xpath('//*[@id="lblXFName"]/text()').extract()).decode('utf-8').encode('gbk'),
'lxnr': ''.join(response.xpath('//*[@id="lblXFQuestion"]/text()').extract()).decode('utf-8').encode(
'gbk'),
'hfnr': ''.join(response.xpath('//*[@id="lblXFanswer"]/text()').extract()).decode('utf-8').encode(
'gbk')}
Domeszxx.item.update(item_getcontent)
yield Domeszxx.item
return