我想飞520 2016-01-13 07:14 采纳率: 0%
浏览 4773

python scrapy爬虫 抓取的内容只有一条,怎么破??

目标URL:http://218.92.23.142/sjsz/szxx/Index.aspx(工作需要)
主要目的是爬取网站中的信件类型、信件主题、写信时间、回复时间、回复状态以及其中链接里面的具体内容,然后保存到excel表格中。里面的链接全部都是POST方法,没有出现一个具体的链接,所以我感觉非常恼火。
目前碰到的问题:
1、 但是我只能抓到第一条的信息,后面就抓不到了。具体是这条:市长您好: 我是一名事...
2、 scrapy运行后出现的信息是:
15:01:33 [scrapy] INFO: Scrapy 1.0.3 started (bot: spider2)
2016-01-13 15:01:33 [scrapy] INFO: Optional features available: ssl, http11
2016-01-13 15:01:33 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spider2.spiders', 'FEED_URI': u'file:///F:/\u5feb\u76d8/workspace/Pythontest/src/Scrapy/spider2/szxx.csv', 'SPIDER_MODULES': ['spider2.spiders'], 'BOT_NAME': 'spider2', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'CSV'}
2016-01-13 15:01:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-13 15:01:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-13 15:01:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-13 15:01:38 [scrapy] INFO: Enabled item pipelines:
2016-01-13 15:01:38 [scrapy] INFO: Spider opened
2016-01-13 15:01:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-13 15:01:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: None)
2016-01-13 15:01:39 [scrapy] DEBUG: Filtered duplicate request: - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
2016-01-13 15:01:39 [scrapy] DEBUG: Redirecting (302) to from
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
2016-01-13 15:01:39 [scrapy] DEBUG: Scraped from
第一条的信息(太多了,就省略了。。。。)
2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) (referer: http://218.92.23.142/sjsz/szxx/Index.aspx)
…………
后面的差不多,就不写出来了
2016-01-13 15:01:41 [scrapy] INFO: Stored csv feed (1 items) in: file:///F:/快盘/workspace/Pythontest/src/Scrapy/spider2/szxx.csv
2016-01-13 15:01:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 56383,
'downloader/request_count': 17,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 14,
'downloader/response_bytes': 118855,
'downloader/response_count': 17,
'downloader/response_status_count/200': 16,
'downloader/response_status_count/302': 1,
'dupefilter/filtered': 120,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 13, 7, 1, 41, 716000),
'item_scraped_count': 1,
'log_count/DEBUG': 20,
'log_count/INFO': 8,
'request_depth_max': 14,
'response_received_count': 16,
'scheduler/dequeued': 17,
'scheduler/dequeued/memory': 17,
'scheduler/enqueued': 17,
'scheduler/enqueued/memory': 17,
'start_time': datetime.datetime(2016, 1, 13, 7, 1, 38, 670000)}
2016-01-13 15:01:41 [scrapy] INFO: Spider closed (finished)

具体的代码如下(代码写的不好,误喷):
import sys, copy

reload(sys)
sys.setdefaultencoding('utf-8')
sys.path.append("../")

from scrapy.spiders import CrawlSpider
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
from items import Spider2Item

class Domeszxx(CrawlSpider):
name = "szxx"
allowed_domain = ["218.92.23.142"]
start_urls = ["http://218.92.23.142/sjsz/szxx/Index.aspx"]
item = Spider2Item()

def parse(self, response):

    selector = Selector(response)

    # 获得下一页的POST参数
    viewstate = ''.join(selector.xpath('//input[@id="__VIEWSTATE"]/@value').extract()[0])
    eventvalidation = ''.join(selector.xpath('//input[@id="__EVENTVALIDATION"]/@value').extract()[0])
    nextpage = ''.join(
            selector.xpath('//input[@name="ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage"]/@value').extract())
    nextpage_data = {
        '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1$ctl12$cmdNext',
        '__EVENTARGUMENT': '',
        '__VIEWSTATE': viewstate,
        '__VIEWSTATEGENERATOR': '9DEFE542',
        '__EVENTVALIDATION': eventvalidation,
        'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage
    }
    # 获得抓取当前内容的xpath
    xjlx = ".//*[@id='ContentPlaceHolder1_GridView1_Label2_"
    xjzt = ".//*[@id='ContentPlaceHolder1_GridView1_LinkButton5_"
    xxsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label4_"
    hfsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label5_"
    nextlink = '//*[@id="ContentPlaceHolder1_GridView1_cmdNext"]/@href'

    # 获取当前页面公开答复的行数
    listnum = len(selector.xpath('//tr')) - 2

    # 获得抓取内容
    for i in range(0, listnum):
        item_all = {}
        xjlx_xpath = xjlx + str(i) + "']/text()"
        xjzt_xpath = xjzt + str(i) + "']/text()"
        xxsj_xpath = xxsj + str(i) + "']/text()"
        hfsj_xpath = hfsj + str(i) + "']/text()"

        # 信件类型
        item_all['xjlx'] = selector.xpath(xjlx_xpath).extract()[0].decode('utf-8').encode('gbk')
        # 信件主题
        item_all['xjzt'] = str(selector.xpath(xjzt_xpath).extract()[0].decode('utf-8').encode('gbk')).replace('\n',
                                                                                                              '')
        # 写信时间
        item_all['xxsj'] = selector.xpath(xxsj_xpath).extract()[0].decode('utf-8').encode('gbk')
        # 回复时间
        item_all['hfsj'] = selector.xpath(hfsj_xpath).extract()[0].decode('utf-8').encode('gbk')

        # 获取二级页面中的POST参数
        eventtaget = 'ctl00$ContentPlaceHolder1$GridView1$ctl0' + str(i + 2) + '$LinkButton5'
        content_data = {
            '__EVENTTARGET': eventtaget,
            '__EVENTARGUMENT': '',
            '__VIEWSTATE': viewstate,
            '__VIEWSTATEGENERATOR': '9DEFE542',
            '__EVENTVALIDATION': eventvalidation,
            'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage
        }
        # 完成抓取信息的传递
        yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.send_value,
                      meta={'item_all': item_all, 'content_data': content_data})

        # 进入页面中的二级页面的链接,必须利用POST方法才能提交,无法看到直接的URL,同时将本页中抓取的item和进入下一页的POST方法进行传递
        # yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.getcontent,
        #               meta={'item': item_all})
        # yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=content_data,
        #                   callback=self.getcontent)

    # 进入下一页
    if selector.xpath(nextlink).extract():
        yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=nextpage_data,
                          callback=self.parse)

# 将当前页面的值传递到本函数并存入类的item中
def send_value(self, response):
    itemx = response.meta['item_all']
    post_data = response.meta['content_data']
    Domeszxx.item = copy.deepcopy(itemx)
    yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=post_data,
                      callback=self.getcontent)
    return

# 将二级链接中值抓取并存入类的item中
def getcontent(self, response):
    item_getcontent = {
        'xfr': ''.join(response.xpath('//*[@id="lblXFName"]/text()').extract()).decode('utf-8').encode('gbk'),
        'lxnr': ''.join(response.xpath('//*[@id="lblXFQuestion"]/text()').extract()).decode('utf-8').encode(
                'gbk'),
        'hfnr': ''.join(response.xpath('//*[@id="lblXFanswer"]/text()').extract()).decode('utf-8').encode(
                'gbk')}
    Domeszxx.item.update(item_getcontent)
    yield Domeszxx.item
    return
  • 写回答

1条回答

  • devmiao 2016-01-19 19:59
    关注
    评论

报告相同问题?

悬赏问题

  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题
  • ¥15 Python时间序列如何拟合疏系数模型