3个问题
1. 你在后面请求时,忘记加headers,导致被网站拦截了,改为 requests.get(xurl, headers=headers)
2.后面应该用content而不是text避免乱码
3.获取的是文本内容,保存为docx也不一定能用word打开,用记事本可以打开
修改后测试可用的代码如下:
#coding: utf-8
import requests
from lxml import etree
url = 'http://www.moe.gov.cn/jyb_xxgk/moe_1777/moe_1778/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)\
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0\
.2743.116 Safari/537.36',
'Accept-Language': 'zh-CN,zh;q=0.8'
}
response = requests.get(url, headers=headers).text
html = etree.HTML(response)
result1 = html.xpath('//ul[@id="list"]//li//a/@href')
for site in result1:
xurl = "http://www.moe.gov.cn/jyb_xxgk/moe_1777/moe_1778/" + site
req = requests.get(xurl, headers=headers)
html2 = etree.HTML(req.content)
result2 = html2.xpath('//p/text()')
fname = r"C:\Users\Administrator\Desktop\1234.docx"
with open(fname, 'wb') as fp:
for i in result2:
fp.write(i.encode('utf-8'))
fp.write('\r\n')
最后善意提醒下:学习技术可以,切勿用于非法用途,遵纪守法。