HtOmArthas 2016-12-03 08:44 采纳率: 33.3%
浏览 2099
已结题

python爬取大量网页出现卡顿问题

用python爬取新浪新闻,在爬到260多篇的时候程序就会卡在正则表达式匹配处,单独爬取该网页没有问题,求问原因
import urllib2
import urllib
import re
import time
from bs4 import BeautifulSoup
def getContent(url,index):
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
myRequest = urllib2.Request(url,headers = headers)
for tries in range(5):

    try:

        myResponse = urllib2.urlopen(myRequest)
        myPage = myResponse.read()
        time.sleep(1)
        break
    except:
        if tries < 5:
            print 'Attempt to reconnect'

            continue
        else:
            print 'connection failed'


textContent = re.findall('<div class="article article_16" id=".*?">(.*?)<p class="article-editor">',myPage,re.S)
while len(textContent) == 0:
    textContent = re.findall('<div class="article article_16" id=".*?">(.*?)<p class="article-editor">',myPage,re.S)
retContent = re.findall('<p.{0,100}?>(.*?)</p>',textContent[0],re.S)
while len(retContent) == 0:
    retContent = re.findall('<p.{0,100}?>(.*?)</p>',textContent[0],re.S)
#print len(retContent)
file = open(str(index)+'.txt','w')
for text in retContent:     
    file.write(text)
file.close() 
print str(index)+'.txt has been written'

#def crawler_urllist(start,end,typeNum):#para type : string
#create date list
partialUrl = ['http://roll.news.sina.com.cn/interface/rollnews_ch_out_interface.php?col=','&spec=&type=&date=','&ch=01&k=&offset_page=0&offset_num=0&num=40&asc=&page=']
format = "%Y-%m-%d"
beginDate = datetime.datetime.strptime(start,format)
endDate = datetime.datetime.strptime(end,format)
oneDay = datetime.timedelta(days = 1)
diffDay = endDate - beginDate
dateList = []
for i in range(diffDay.days + 1):
dateList.append(datetime.datetime.strftime(beginDate,format))
beginDate += oneDay
retList = []

    for i in range(len(dateList)):
        url = partialUrl[0] + typeNum + partialUrl[1] + dateList[i] + partialUrl[2]
        page = 0
        while True:

            page += 1
            continueFlag,myPage = getContent(url + str(self.page))
            print(url + str(self.page))
            #print self.page
            #print self.initUrl + str(self.page)
            if continueFlag:
                self.pages.append(myPage)
            else:
                self.page = 0
                break                   
        else:
            time.sleep(1)
    date_url_list = open(typeNum + '.txt','w')
    for item in retList:
        date_url_list.write(item)
        date_url_list.write('\n')           
    date_url_list.close()
    return type + '.txt'

i = 1
urlFile = open('DomesticNewsUrlList.txt','r')
naviurlList = urlFile.readlines()
urlList = list(set(naviurlList))[0].split()
#urllist1 = urlList[0:5]
for url in urlList:
getContent(url,i)
#time.sleep(1)
i += 1



  • 写回答

2条回答 默认 最新

  • mega4221 2016-12-03 10:59
    关注

    给request加上timeout看看,sleep时间在加大一点,正则匹配哪里try一下

    评论

报告相同问题?

悬赏问题

  • ¥15 Arduino红外遥控代码有问题
  • ¥15 数值计算离散正交多项式
  • ¥30 数值计算均差系数编程
  • ¥15 redis-full-check比较 两个集群的数据出错
  • ¥15 Matlab编程问题
  • ¥15 训练的多模态特征融合模型准确度很低怎么办
  • ¥15 kylin启动报错log4j类冲突
  • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大
  • ¥15 import arcpy出现importing _arcgisscripting 找不到相关程序
  • ¥15 onvif+openssl,vs2022编译openssl64