MM要学js,但是上不了网,要求我帮她把网上一个教程的相关内容抓出来,于是就有了下面的代码:
import urllib2 import urllib import re from sgmllib import SGMLParser class URLLister(SGMLParser): def reset(self): SGMLParser.reset(self) self.urls = [] def start_a(self, attrs): href = [v for k, v in attrs if k=='href'] if href: self.urls.extend(href) js_root_url = "http://www.w3school.com.cn/js/" #ep_root_url = "http://www.w3school.com.cn" index_url = "index.asp" f = urllib2.urlopen(js_root_url + index_url) webfile = urllib.urlopen(js_root_url + index_url).read() fp = file('index.asp', 'w+') fp.write(webfile) fp.close() if f.code == 200: parser = URLLister() parser.feed(f.read()) f.close() #url_pattern = re.compile(r'(^/js/js_|^/tiy/)\D*') url_js_pattern = re.compile(r'^/js/js\D*') #url_example_pattern = re.compile(r'^/tiy/\D*') url_sub_js_pattern = re.compile(r'^/js/js') for url in parser.urls: if url_js_pattern.search(url): url = url_sub_js_pattern.sub('js', url) webfile = urllib.urlopen(js_root_url + url).read() fp = file( url , 'w+') fp.write(webfile) fp.close()
但是现在还是有问题存在的,最明显的是,单击index页上的超链接无法访问抓取到的一级页面