Python string.replace()
 #coding=utf-8

import re
from bs4 import BeautifulSoup as BS
import requests
import hackhttp

# BeautifulSoup

url='https://www.douyu.com/directory/game/LOL'
r=requests.get(url,verify=False)
html=r.content

soup=BS(html,'lxml')
bbs=soup.find_all(name='h3',attrs={'class':'ellipsis'})
print bbs
for news in bbs:
print news.string.replace('\r','').replace('\n','')

结果:
Traceback (most recent call last):
File "spider.py", line 18, in <module>
print news.string.replace('\r','').replace('\n','')
**AttributeError: 'NoneType' object has no attribute 'replace'**

3个回答

news里面前提取的内容是None。应该是你提取,

标签时候把一些没用的标签也提取出来了。 去掉这个就可以了
for news in bbs:
if(news.string!=None):
print( news.string.replace('\r','').replace('\n',''))

u013928906
zy820 谢了
接近 3 年之前 回复

先看看news.string是不是为None,没有获取到对应的数据。

u013928906
zy820 回复oyljerry: 对,前几个是None
接近 3 年之前 回复
oyljerry
oyljerry 回复zy820: 代码出错的时候,应该有发生没数据的情况
接近 3 年之前 回复
u013928906
zy820 有数据
接近 3 年之前 回复

你这个最后的代码是想实现什么功能呀

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
使用了replace()函数,结果却没有变化?
我想把一个字符串'6,105.00'中的','去掉,方便后面转化为浮点数, 但是使用了.replace(',', '')后并没有效果,系统还是报错?求解原因,谢谢! 代码: ``` import pandas as pd import sys # input_file = sys.argv[1] # output_file = sys.argv[2] input_file = input() output_file = input() data_frame = pd.read_csv(input_file) data_frame['Cost'] = data_frame['Cost'].str.strip('$').replace(',', '').astype(float) data_frame_value_meets_condition = data_frame.loc[(data_frame['Supplier Name'].str.contains('Z')) | (data_frame['Cost'] > 600.0), :] ``` 报错: return arr.astype(dtype, copy=True) ValueError: could not convert string to float: '6,015.00 '
Python程序转成java程序输出结果不对
Python程序如下: ``` import base64 import hashlib def encode_64(string,key): key = hashlib.md5(key.encode('utf-8')).hexdigest() print key key_length = len(key) string = hashlib.md5((string+key).encode('utf-8')).hexdigest()[0:8]+string print string rndkey = [] box = [] for i in range(0,256): rndkey.append(ord(key[i%key_length])) box.append(i) j = 0 for i in range(0,256): j = (j+box[i]+rndkey[i])%256 tmp = box[i] box[i] = box[j] box[j] = tmp a = 0 j = 0 result = "" for i in range(len(string)): a = (a+1)%256 j = (j+box[a])%256 tmp = box[a] box[a] = box[j] box[j] = tmp result += chr(ord(string[i])^(box[(box[a]+box[j])%256])) result = base64.b64encode(result) return result.replace('=','') string = encode_64("passw0rd","toyou") print string ``` java程序如下: ``` package nimp; import java.security.MessageDigest; import sun.misc.BASE64Encoder; public class JM { private static MessageDigest md5 = null; static { try { md5 = MessageDigest.getInstance("MD5"); } catch (Exception e) { System.out.println(e.getMessage()); } } /** * 用于获取一个String的md5值 * @param string * @return */ public static String getMd5(String str) { byte[] bs = md5.digest(str.getBytes()); StringBuilder sb = new StringBuilder(40); for(byte x:bs) { if((x & 0xff)>>4 == 0) { sb.append("0").append(Integer.toHexString(x & 0xff)); } else { sb.append(Integer.toHexString(x & 0xff)); } } return sb.toString(); } public static String enc(String string) throws Exception{ String result = ""; String key = getMd5("toyou"); System.out.println(key); int key_length = key.length(); char[] keyc = key.toCharArray(); string = getMd5(string+key).substring(0, 8) + string; System.out.println(string); int string_length = string.length(); int[] rndkey = new int[256]; int[] box = new int[256]; int i,j; for(i=0;i<=255;i++){ rndkey[i] = (int)keyc[i%key_length]; box[i] = i; } i = 0;j = 0; for(;i<=255;i++){ j = (j+box[i]+rndkey[i])%256; int x = box[i]; box[i] = box[j]; box[j] = x; } int a=0; i = 0;j = 0; int[] r = new int[string_length]; char[] re = new char[string_length]; System.out.println(string_length); for(;i<string_length;i++){ a = (a+1)%256; j = (j+box[a])%256; int x = box[a]; box[a] = box[j]; box[j] = x; int y = ((int)string.toCharArray()[i])^(box[(box[a]+box[j])%256]); r[i] = y; System.out.println(y); result += (char)y; re[i] = (char)y; System.out.println(re[i]); } System.out.println(result); System.out.println(String.valueOf(re)); result = new BASE64Encoder().encode(result.getBytes("utf-8")); result.replace("=", ""); return result; } public static void main(String[] args) throws Exception { // TODO Auto-generated method stub System.out.println(enc("passw0rd")); } } ``` 我自己测了测是从ASCII转字符那开始两个程序不同的,求解决办法
python read in data请大神修改代码
Reading In Data In the code cell below the function read_dow has 3 parameters "file_path" which contains a string to the path of a dataset, columns which contains a list of the indices of columns to select, and "num_lines" which is an integer. For this problem use Python's built-in functions to read in data from "file_path". Read in each row from "file_path". replace all instances of the new line character ("\n") with an empty string (""). Hint: Use the replace function of a string seperate the columns using a comma as a delimiter Hint: Use the split function of a string only select the indices inside of a row and store them in a seperate list, append the seperated row of data without the "\n" character and only the columns indices passed in to a list. if columns is None select all rows. stop reading in new rows once you have read in "num_lines" lines. Return a nested list where each element in the list is a row of data. def read_dow(filePath, columns=None, num_lines=3): smallFile = [] bigFile = [] with open(filePath, 'r') as fin: iCounter = 0 list=[] for list in fin: for i in list: iCounter += 1 if columns != None and iCounter > 4 : break else: temprow = i.replace("\n", "").split(",") smallFile.append(temprow) bigFile.append(smallFile) return bigFile
Python爬虫在Django中的使用问题
新入门Django,现在已经写好了一个Python爬虫,直接用Python跑测试没问题, ------在Django项目中加入了一个新的爬虫app,用model创建了表格,和展示爬虫的html ------但是runserver, 以后查看db.sqlite3里面对应的表已经创建,但是里面没有存爬到的内容, ------ 请大神们指教该怎么办, 代码如下 Spider.py, 爬虫并存入model.py 创建的**Website**表 ``` #!/usr/bin/python # -*- coding: utf-8 -*- # import data into mysql(sqlite3), must have these four lines defination: import os # # 我所创建的project名称为learn_spider;里面的app名称为website os.environ.setdefault("DJANGO_SETTINGS_MODULE", "blogproject.settings") # import django # django.setup() # urllib2 package: open resource by URL; re package: use regular expression to filter the objects import urllib.request, re import urllib.parse # BeautifulSoup: abstract data clearly from html/xml files from bs4 import BeautifulSoup # import tables from models.py from .models import Website # urlopen()方法需要加read()才可视源代码,其中decode("utf-8")表示以utf-8编码解析原网页,这个编码格式是根据网页源代码中<head>标签下的<meta charset="utf-8">来决定的。 ul = "https://baike.baidu.com/item/Python" req = urllib.request.Request(ul) html_python = urllib.request.urlopen(req).read().decode("utf-8") #html_python = urllib.request.urlopen('https://baike.baidu.com/item/Python').read().decode("utf-8") soup_python = BeautifulSoup(html_python, "html.parser") # print soup #这里用到了正则表达式进行筛选 item_list = soup_python.find_all('a', href=re.compile("item")) for each in item_list: print (each.string) # use quote to replace special characters in string(escape encode method) urls = "https://baike.baidu.com/item/" + urllib.parse.quote(each.string.encode("utf-8")) print (urls) html = urllib.request.urlopen(urls).read().decode("utf-8") soup = BeautifulSoup(html, "html.parser") if soup.find('div', 'lemma-summary') == None: text = "None" else: text = soup.find('div', 'lemma-summary').get_text() print (text) Website.objects.get_or_create(name=each.string, url=urls, text=text) text_python = soup_python.find('div', 'lemma-summary').text Website.objects.get_or_create(name="Python", url="https://baike.baidu.com/item/Python", text=text_python) ``` model.py 创建Website 表用于存储爬到的内容 ``` # -*- coding: utf-8 -*- from __future__ import unicode_literals from django.db import models # Create your models here. class Website(models.Model): name = models.CharField(max_length=100) url = models.CharField(max_length=100) text = models.TextField() def __unicode__(self): return self.name ``` view.py 提取表中已爬取的内容 ``` from __future__ import unicode_literals from django.shortcuts import render # Create your views here. from .models import Website def show(request): # 这里直接通过QuerySet API获取所有的object,默认返回类型为tuple(元组) queryset = Website.objects.all() # 传入三个渲染参数 return render(request, 'news/nws.html', {'QuerySet': queryset}) ```
python笔趣阁报错:SyntaxError: invalid syntax
自己在论坛上面找了一份python3爬虫的代码,但是比照着写就出现了上面的问题,求助大家帮我看一下。 import requests from bs4 import BeautifulSoup """ 说明:下载《笔趣阁》小说《一念永恒》 parameter: 无 Return: 无 Modify: 2019-06-27 """ class downloader(object): def _init_(self): self.server='https://www.biqukan.com/' self.url='https://www.biqukan.com/1_1094/' self.name=[] self.urls=[] self.nums=0 """ 函数说明:获取下载链接 Parameters: 无 Returns: 无 Modify: 2019-06-27 """ def get_download_url(self): resp = requests.get(url) html=resp.text resp.encoding=resp.apparent_encoding if html: with open('test.html',mode='a+',encoding=resp.apparent_encoding) as file: file.write(html) div_bf = BeautifulSoup(html) div=div_bf.find_all('div', class_ = 'listmain') a_bf = BeautifulSoup(str(div[0])) a = a_bf.find_all('a') self.nums=len(a[15:]) for each in a[15:]: self.names.append(each.string) self.urls.append(self.server+each.get('href') """ 函数说明:获取章节内容 Parameters: url - 下载连接(string) Returns: texts - 章节内容(string) Modify: 2019-6-27 """ def get_contents(self, url): req = requests.get(url) html = resp.text bf = BeautifulSoup(html) texts = bf.find_all('div', class_ = 'showtxt') texts = texts[0].text.replace('\xa0'*8,'\n\n') return texts """ 函数说明:将爬取的文章内容写入文件 Parameters: name - 章节名称(string) path - 当前路径下,小说保存名称(string) text - 章节内容(string) Returns: 无 Modify: 2019-06-27 """ def writer(self, name, path, text): write_flag = True with open(path, 'a', encoding='utf-8') as f: f.write(name + '\n') f.writelines(text) f.write('\n\n') dl = downloader() dl.get_download_url() print('《一年永恒》开始下载:') for i in range(dl.nums): dl.writer(dl.names[i], '一念永恒.txt', dl.get_contents(dl.urls[i])) sys.stdout.write("已下载:%.3f%%" % float(i/dl.nums) + '\r') sys.stdout.flush() print('《一年永恒》下载完成')
python递归式不能正常工作
我写了个个python方法想要归类wikipedia,然后在做递归式的时候想要把导入的parameter更换,可是却改变不了 import csv from bs4 import BeautifulSoup import urllib.request string_set=[] url_link=[] def get_first_category(url): k=urllib.request.urlopen(url) soup=BeautifulSoup(k) s=soup.find_all('a') for i in s: string_set.append(i.string) for i in range(-len(string_set), 0): if string_set[i] == ("Categories"): return (string_set[i + 1]) def join_with(k): return k.replace(" ","_") def get_category_page(k): p=["https://en.wikipedia.org/wiki/Category:",k] return "".join(p) def return_link(url): return (get_category_page(join_with(get_first_category(url)))) file=open("Categories.csv") categories=csv.reader(file) categories=zip(*categories) def find_category(url): k=get_first_category(url) for i in categories: if k in i: return [True,i[0]] return [False,k] category_url='' def main(url): j=find_category(url) if j[0]: return j[1] else: url=return_link(url) print(url) return main(url) #print(return_link('https://en.wikipedia.org/wiki/Category:Mathematics_of_infinitesimals')) print (main('https://en.wikipedia.org/wiki/Category:Charitable_organizations')) 这是我的code,然后 https://www.dropbox.com/s/647hpufvzq0fksk/Categories.csv?dl=0 这是csv的地址,按理说应该会一个个找下去可是parameter被改变一次就不再被改变,已经试了很多方法了可是还是没有办法
pyinstaller封装:str object has no attribute 'items'
打包的代码如下: import requests import re import openpyxl url='http://www.tensorfly.cn/tfdoc/get_started/basic_usage.html' saveas='Tensorflow.xlsx' def get_html(url): r=requests.get(url) print(r.status_code) #r.encodeing='utf-8' print(r.encoding) html=r.text.encode(r.encoding).decode('utf-8') #print(html) return html def get_string(html): string=r'<b>(.*?)</b>[\W^\u4e00-\u9fa5]*?(.*?)[\W^\u4e00-\u9fa5]*?</a>' string2=r'<b>(.*?)</b>([\w\W]*?)</a>' s=re.compile(string2) codelist=s.findall(html) print(codelist) return codelist def clean_string(codelist): exlist=[] for i in range(0,len(codelist)): #print(list(codelist[i])) exlist.append(list(codelist[i])) exlist[i][1]=exlist[i][1].replace('\n','') exlist[i][1]=exlist[i][1].replace(' ','') print(exlist) return exlist def get_excel_list(codelist,saveas): wk=openpyxl.load_workbook(saveas) st=wk.worksheets[0] print(len(codelist)) for i in range(1,len(codelist)+1): st.cell(i,1).value=codelist[i-1][0] st.cell(i,2).value=codelist[i-1][1] wk.save(saveas) html=get_html(url) codelist=get_string(html) exlist=clean_string(codelist) get_excel_list(exlist,saveas) ![图片说明](https://img-ask.csdn.net/upload/201809/14/1536892723_227369.png) 求大神帮助分析问题原因
python处理文本中结构化数据的问题
![图片说明](https://img-ask.csdn.net/upload/201708/28/1503910320_277922.jpg) 如图所示,有一批文本数据,包含学生的姓名,年龄,家乡学号以及格言,每项占一行,每名同学信息以空行隔开,现在想用python将其转换为如下形式: ![图片说明](https://img-ask.csdn.net/upload/201708/28/1503910090_295842.jpg) import sys,os,string def readvote(file_in,file_ou): fp_in = open(file_in,'r') fp_ou = open(file_ou,'w+') flag = ["nam","age","hom","num","txt"] str_ou = "" for line in fp_in: for i in range(0,len(flag)): sub_flag = line[0:3] if sub_flag == flag[i]: str_i = line[4:len(line)].replace("\n",",") str_ou = str_ou + str_i str_ou = str_ou[0:(len(str_ou)-1)] fp_ou.write(str_ou) if __name__ =='__main__': path_in = r"d:\v0\student.txt" path_ou = r"d:\v1\student_v.txt" readvote(path_in,path_ou) 结果就是这样: ![图片说明](https://img-ask.csdn.net/upload/201708/28/1503910532_802684.jpg) 自己临时想的这个代码确实有很大的bug,也曾想过用字典来处理,但是字典的顺序是随机的,不适合处理。希望大神多多指教。谢谢
pythonAES加密使用CBC模式与JAVA的CBC加密结果不同!求解!
JAVA的AES加密示例 import javax.crypto.Cipher; import javax.crypto.spec.IvParameterSpec; import javax.crypto.spec.SecretKeySpec; import java.nio.charset.StandardCharsets; import java.util.Base64; public class AesUtil { private static final String KEY_ALGORITHM = "AES"; private static final String DEFAULT_CIPHER_ALGORITHM = "AES/CBC/PKCS5Padding"; public static String encrypt(String strKey, String strIn) throws Exception { SecretKeySpec skeySpec = getKey(strKey); Cipher cipher = Cipher.getInstance(DEFAULT_CIPHER_ALGORITHM); IvParameterSpec iv = new IvParameterSpec(strKey.getBytes()); cipher.init(Cipher.ENCRYPT_MODE, skeySpec, iv); byte[] encrypted = cipher.doFinal(strIn.getBytes()); String aesResult = new String(Base64.getEncoder().encode(encrypted), StandardCharsets.UTF_8); return aesResult.replace("/", "_"); } private static SecretKeySpec getKey(String strKey) throws Exception { byte[] arrBTmp = strKey.getBytes(); byte[] arrB = new byte[16]; for (int i = 0; i < arrBTmp.length && i < arrB.length; i++) { arrB[i] = arrBTmp[i]; } return new SecretKeySpec(arrB, KEY_ALGORITHM); } } 这个是我用python写的AES加密 from Crypto.Cipher import AES import base64 from time import sleep class aescrypt: def __init__(self, key, model, iv, encode_): self.encode_ = encode_ self.model = {'ECB': AES.MODE_ECB, 'CBC': AES.MODE_CBC}[model] self.key = self.add_16(key) if model == 'ECB': self.aes = AES.new(self.key, self.model) # 创建一个aes对象 elif model == 'CBC': self.aes = AES.new(self.key, self.model, iv, segment_size=128) # 创建一个aes对象 elif model == 'CFB': self.aes = AES.new(self.key, self.model) # 这里的密钥长度必须是16、24或32 def add_16(self, par): par = par.encode(self.encode_) while len(par) % 16 != 0: par += b'n' return par def aesencrypt(self, text): # 此处text传入的值为cleartext text = self.add_16(text) print(text) self.encrypt_text = self.aes.encrypt(text) return base64.encodebytes(self.encrypt_text).decode().strip() def aesdecrypt(self, text): text = base64.decodebytes(text.encode(self.encode_)) self.decrypt_text = self.aes.decrypt(text) return self.decrypt_text.decode(self.encode_).strip('\0') if __name__ == '__main__': keyy = input("输入密钥:") cleartext = input("输入明文:") pr = aescrypt(keyy, 'CBC', IV, 'utf8'),此处为加密模式及内容,IV en_text = pr.aesencrypt(cleartext) print('密文:', en_text) print('明文:', pr.aesdecrypt(en_text)) 这里我写了两个加密模式,上述代码已经换成同样的CBC模式加密
请问用BeautifulSoup如何获取p标签内的值
从网上爬下来了一道数学题,不知道该怎样获取里面的值了 ``` soup = BeautifulSoup(problem_content, 'html.parser') # 这个problem_content是个从网上爬下来的数学题,里面包含很多的HTML标签 # print soup # 全部是乱码,网上找资料说是没有\xa0这个编码 # print soup.prettify().replace(u'\xa0', '') # 安装文档形式输出,正常输出 # print soup.p # 输出的p标签内容全部是乱码 # print soup.encode('gb18030') # 除中文外全部是 乱码 new_soup = soup.prettify().replace(u'\xa0', '') # new_soup为unicode格式 s_soup = BeautifulSoup(new_soup, 'html.parser') # 再次将其转为bs4格式数据 cont = s_soup.p.encode('gb18030') # print type(cont)返回 <type 'str'> # print type(new_soup) 返回 <type 'unicode'> print cont ``` print cont返回值是: ``` <p> 如图所示,圆锥 $SO$ 的轴截面 $△$$SAB$ 是边长为$ 4 $的正三角形,$M$为母线 $SB$的中点,过直线 $AM$ 作平面 $β$ $⊥$ 面 $SAB$ ,设 $β$ <span> 与圆锥侧面的交线为椭圆 $C$,则椭圆 $C$ 的短半轴 </span> <span style="font-size:12px;line-height:1.5;"> 为( ) </span> </p> ``` 由于这个cont是个str类型的值,请问老师我该如何才能回去这个值<p>标签内部的值?用cont.string返回错误提示:cont没有string属性
采用scrapy框架爬取二手房数据,显示没有爬取到页面和项目,不清楚问题原因
1.item ``` import scrapy class LianjiaItem(scrapy.Item): # define the fields for your item here like: # 房屋名称 name = scrapy.Field() # 房屋户型 type = scrapy.Field() # 建筑面积 area = scrapy.Field() # 房屋朝向 direction = scrapy.Field() # 装修情况 fitment = scrapy.Field() # 有无电梯 elevator = scrapy.Field() # 房屋总价 total_price = scrapy.Field() # 房屋单价 unit_price = scrapy.Field() # 房屋产权 property = scrapy.Field() ``` 2.settings ``` BOT_NAME = 'lianjia' SPIDER_MODULES = ['lianjia.spiders'] NEWSPIDER_MODULE = 'lianjia.spiders' USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'lianjia.pipelines.FilterPipeline': 100, 'lianjia.pipelines.CSVPipeline': 200, } ``` 3.pipelines ``` import re from scrapy.exceptions import DropItem class FilterPipeline(object): def process_item(self,item,spider): item['area'] = re.findall(r"\d+\.?\d*",item["area"])[0] if item["direction"] == '暂无数据': raise DropItem("房屋朝向无数据,抛弃此项目:%s"%item) return item class CSVPipeline(object): index = 0 file = None def open_spider(self,spider): self.file = open("home.csv","a") def process_item(self, item, spider): if self.index == 0: column_name = "name,type,area,direction,fitment,elevator,total_price,unit_price,property\n" self.file.write(column_name) self.index = 1 home_str = item['name']+","+item['type']+","+item['area']+","+item['direction']+","+item['fitment']+","+item['elevator']+","+item['total_price']+","+item['unit_price']+","+item['property']+"\n" self.file.write(home_str) return item def close_spider(self,spider): self.file.close() ``` 4.lianjia_spider ``` import scrapy from scrapy import Request from lianjia.items import LianjiaItem class LianjiaSpiderSpider(scrapy.Spider): name = 'lianjia_spider' # 获取初始请求 def start_requests(self): # 生成请求对象 url = 'https://bj.lianjia.com/ershoufang/' yield Request(url) # 实现主页面解析函数 def parse(self, response): # 使用xpath定位到二手房信息的div元素,保存到列表中 list_selector = response.xpath("//li/div[@class = 'info clear']") # 依次遍历每个选择器,获取二手房的名称,户型,面积,朝向等信息 for one_selector in list_selector: try: name = one_selector.xpath("div[@class = 'title']/a/text()").extract_first() other = one_selector.xpath("div[@class = 'address']/div[@class = 'houseInfo']/text()").extract_first() other_list = other.split("|") type = other_list[0].strip(" ") area = other_list[1].strip(" ") direction = other_list[2].strip(" ") fitment = other_list[3].strip(" ") total_price = one_selector.xpath("//div[@class = 'totalPrice']/span/text()").extract_first() unit_price = one_selector.xpath("//div[@class = 'unitPrice']/@data-price").extract_first() url = one_selector.xpath("div[@class = 'title']/a/@href").extract_first() yield Request(url,meta={"name":name,"type":type,"area":area,"direction":direction,"fitment":fitment,"total_price":total_price,"unit_price":unit_price},callback=self.otherinformation) except: pass current_page = response.xpath("//div[@class = 'page-box house-lst-page-box']/@page-data").extract_first().split(',')[1].split(':')[1] current_page = current_page.replace("}", "") current_page = int(current_page) if current_page < 100: current_page += 1 next_url = "https://bj.lianjia.com/ershoufang/pg%d/" %(current_page) yield Request(next_url,callback=self.otherinformation) def otherinformation(self,response): elevator = response.xpath("//div[@class = 'base']/div[@class = 'content']/ul/li[12]/text()").extract_first() property = response.xpath("//div[@class = 'transaction']/div[@class = 'content']/ul/li[5]/span[2]/text()").extract_first() item = LianjiaItem() item["name"] = response.meta['name'] item["type"] = response.meta['type'] item["area"] = response.meta['area'] item["direction"] = response.meta['direction'] item["fitment"] = response.meta['fitment'] item["total_price"] = response.meta['total_price'] item["unit_price"] = response.meta['unit_price'] item["property"] = property item["elevator"] = elevator yield item ``` 提示错误: ``` de - interpreting them as being unequal if item["direction"] == '鏆傛棤鏁版嵁': 2019-11-25 10:53:35 [scrapy.core.scraper] ERROR: Error processing {'area': u'75.6', 'direction': u'\u897f\u5357', 'elevator': u'\u6709', 'fitment': u'\u7b80\u88c5', 'name': u'\u6b64\u6237\u578b\u517113\u5957 \u89c6\u91ce\u91c7\u5149\u597d \u65e0\u786c\u4f24 \u4e1a\u4e3b\u8bda\u610f\u51fa\u552e', 'property': u'\u6ee1\u4e94\u5e74', 'total_price': None, 'type': u'2\u5ba41\u5385', 'unit_price': None} Traceback (most recent call last): File "f:\python_3.6\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "F:\python_3.6\lianjia\lianjia\pipelines.py", line 25, in process_item home_str = item['name']+","+item['type']+","+item['area']+","+item['direction']+","+item['fitment']+","+item['elevator']+","+item['total_price']+","+item['unit_price']+ ","+item['property']+"\n" TypeError: coercing to Unicode: need string or buffer, NoneType found ```
Python爬取小说 有些章节爬得到有些爬不到 分别爬取都是可以的
``` # -*- coding:UTF-8 -*- from bs4 import BeautifulSoup import requests, sys import csv server = 'http://www.biqukan.com/' target = 'http://www.biqukan.com/1_1094/' names = [] # 存放章节名 urls = [] # 存放章节链接 def get_download_urls(): req = requests.get(url=target) html = req.text bf = BeautifulSoup(html, 'html.parser') div = bf.find('div', class_='listmain') dl = div.find('dl') dd = dl.find_all('dd') for each in dd[15:]: names.append(each.string) urls.append(server + each.find('a').get('href')) def get_contents(u): req = requests.get(url=u) html = req.text bf = BeautifulSoup(html, 'html.parser') texts = bf.find_all('div',{'id': 'content'},class_ = 'showtxt') if len(texts)>0: final = texts[0].text.replace('\xa0' * 8, '\n\n') else: final='' return final def writer( name, path,text): write_flag = True with open(path, 'a', encoding='utf-8') as f: fieldnames = ['title'] writefile = csv.DictWriter(f, fieldnames=fieldnames) writefile.writerow({'title': name+'\n'+text}) if __name__ == "__main__": url=get_download_urls() for i in range(len(names)): writer(names[i], '一念永恒.txt', get_contents(urls[i])) ``` 就是 get_contents(u) 函数里为啥有的texts长度会等于0呢,单独爬这一个页面的时候texts是有内容的呀
在Linux下Python脚本进行数据抽取,请教各位大神怎么才能批量抽取。
#!/usr/bin/python # -*- coding:utf-8 -*- import cx_Oracle import datetime import time import os from sys import * from string import * import tty, termios from dbipaddr import * if len(argv) !=2: print "Usage: python up_bill_discount.py 统计日期[YYYYMMDD]\n" exit() path=os.environ["HOME"] + "/trunk/etl/ecds/output/" starttime = datetime.datetime.now() data_dt = '%s-%s-%s' % ( argv[1][0:4], argv[1][4:6], argv[1][6:8] ) oconn = cx_Oracle.connect("fina","fina", "192.168.10.85:1521/cbs") acur = oconn.cursor() acur.execute("select * from JYLS")######现在只能抽取一个表的!!! filename = path+argv[1]+"/BILL_DISCOUNT" + replace(data_dt, '-', '') + ".txt" file = open(filename, "w+") bCount = 0 for a in acur: record = (a) for i in range( len(record) ): file.write( "%s" % record[i] ) if i < len(record) - 1: file.write(":") else: file.write("\n") bCount = bCount + 1 if bCount % 1000 == 0: print "贴现登记抽取已完成:%s" % bCount file.close() fileok = path+argv[1]+"/BILL_DISCOUNT" + replace(data_dt, '-', '') + ".ok" fok = open( fileok, 'w+' ) fok.close() endtime = datetime.datetime.now() print "贴现登记抽取已完成,总共用时:%s" % (endtime-starttime) 我是该怎么修改,才能实现批量抽取几十个表的,求手动修改代码。。。
用python模拟新浪微博登录出现retcode=4049
是要输入验证码吗?我已经设置了再部分区域不用输入验证码了。求大神回答 模拟登陆的代码如下: # -*- coding: utf-8 -*- import requests import base64 import re import urllib import rsa import json import binascii import string from weibo import Client import random import time import logging, logging.handlers ''' code = "5f9f84b2aa3198032416963c84c2d182" app_key = "4055744185" app_secret = "838ec8be666e6116c4e483ed14e5fea4" redirect_uri = "https://api.weibo.com/oauth2/default.html" user_name = '412062385@qq.com' pwd = 'googlesina' ''' class SinaCrawler: def __init__(self, max_page): self.session = None self.MAX_PAGE = max_page token = {u'access_token': u'2.00pE39sBn1UT7E61e7174d95TdYVED', u'remind_in': u'157679999', u'uid': u'1720813027', u'expires_at': 1575304674} self.client = Client(app_key, app_secret, redirect_uri, token) self.f = open("data", "w") def __del__(self): self.f.close() def userlogin(self,username,password): session = requests.Session() url_prelogin = 'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=&rsakt=mod&client=ssologin.js(v1.4.18)&_=1430736851146' url_login = 'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.8)' #get servertime,nonce, pubkey,rsakv resp = session.get(url_prelogin) print resp.content p = re.compile('{.*}') json_data = re.search(p, resp.content).group() print json_data data = eval(json_data) servertime = data['servertime'] print 'servertime:',servertime nonce = data['nonce'] pubkey = data['pubkey'] rsakv = data['rsakv'] # calculate su su = base64.b64encode(urllib.quote(username)) #calculate sp rsaPublickey= int(pubkey,16) key = rsa.PublicKey(rsaPublickey,65537) message = str(servertime) +'\t' + str(nonce) + '\n' + str(password) sp = binascii.b2a_hex(rsa.encrypt(message,key)) postdata = { 'entry': 'weibo', 'gateway': '1', 'from': '', 'savestate': '7', 'userticket': '1', 'ssosimplelogin': '1', 'vsnf': '1', 'vsnval': '', 'su': su, 'service': 'miniblog', 'servertime': servertime, 'nonce': nonce, 'pwencode': 'rsa2', 'sp': sp, 'encoding': 'UTF-8', 'url': 'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack', 'returntype': 'META', 'rsakv' : rsakv, } resp = session.post(url_login,data = postdata) # print resp.headers login_url = re.findall('replace\(\"(.*)\"\)',resp.content) print 'login_url,[0]',login_url,login_url[0] respo = session.get(login_url[0]) print respo.content self.session = session if __name__ == '__main__': sina_crawler = SinaCrawler(50) sina_crawler.userlogin('lih211@sina.com', 'ta3344521')
CentOS sublime text build error
build system 文件内容: { "cmd": ["make", "linux"], "working_dir": "/home/cyb/lua-5.3.4/src", "encoding": "utf-8" } build 错误信息: Running make linux Exception in thread Thread-4: Traceback (most recent call last): File "./python3.3/threading.py", line 901, in _bootstrap_inner File "./python3.3/threading.py", line 858, in run File "/opt/sublime_text/Packages/Default.sublime-package/exec.py", line 152, in read_fileno decoder_cls = codecs.getincrementaldecoder(self.listener.encoding) AttributeError: 'NoneType' object has no attribute 'encoding' 如果我把build system里的命令写错,比如把make写成其他的不存在的命令,在sublime的build result窗口会显示我的命令有错误,但是如果命令对了,成功进入make的环节,就会报错~ code in exec.py ``` import collections import functools import html import os import subprocess import sys import threading import time import codecs import signal import sublime import sublime_plugin class ProcessListener(object): def on_data(self, proc, data): pass def on_finished(self, proc): pass class AsyncProcess(object): """ Encapsulates subprocess.Popen, forwarding stdout to a supplied ProcessListener (on a separate thread) """ def __init__(self, cmd, shell_cmd, env, listener, path="", shell=False): """ "path" and "shell" are options in build systems """ if not shell_cmd and not cmd: raise ValueError("shell_cmd or cmd is required") if shell_cmd and not isinstance(shell_cmd, str): raise ValueError("shell_cmd must be a string") self.listener = listener self.killed = False self.start_time = time.time() # Hide the console window on Windows startupinfo = None if os.name == "nt": startupinfo = subprocess.STARTUPINFO() startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW # Set temporary PATH to locate executable in cmd if path: old_path = os.environ["PATH"] # The user decides in the build system whether he wants to append $PATH # or tuck it at the front: "$PATH;C:\\new\\path", "C:\\new\\path;$PATH" os.environ["PATH"] = os.path.expandvars(path) proc_env = os.environ.copy() proc_env.update(env) for k, v in proc_env.items(): proc_env[k] = os.path.expandvars(v) if sys.platform == "win32": preexec_fn = None else: preexec_fn = os.setsid if shell_cmd and sys.platform == "win32": # Use shell=True on Windows, so shell_cmd is passed through with the correct escaping self.proc = subprocess.Popen( shell_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE, startupinfo=startupinfo, env=proc_env, shell=True) elif shell_cmd and sys.platform == "darwin": # Use a login shell on OSX, otherwise the users expected env vars won't be setup self.proc = subprocess.Popen( ["/usr/bin/env", "bash", "-l", "-c", shell_cmd], stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE, startupinfo=startupinfo, env=proc_env, preexec_fn=preexec_fn, shell=False) elif shell_cmd and sys.platform == "linux": # Explicitly use /bin/bash on Linux, to keep Linux and OSX as # similar as possible. A login shell is explicitly not used for # linux, as it's not required self.proc = subprocess.Popen( ["/usr/bin/env", "bash", "-c", shell_cmd], stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE, startupinfo=startupinfo, env=proc_env, preexec_fn=preexec_fn, shell=False) else: # Old style build system, just do what it asks self.proc = subprocess.Popen( cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE, startupinfo=startupinfo, env=proc_env, preexec_fn=preexec_fn, shell=shell) if path: os.environ["PATH"] = old_path if self.proc.stdout: threading.Thread( target=self.read_fileno, args=(self.proc.stdout.fileno(), True) ).start() if self.proc.stderr: threading.Thread( target=self.read_fileno, args=(self.proc.stderr.fileno(), False) ).start() def kill(self): if not self.killed: self.killed = True if sys.platform == "win32": # terminate would not kill process opened by the shell cmd.exe, # it will only kill cmd.exe leaving the child running startupinfo = subprocess.STARTUPINFO() startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW subprocess.Popen( "taskkill /PID %d /T /F" % self.proc.pid, startupinfo=startupinfo) else: os.killpg(self.proc.pid, signal.SIGTERM) self.proc.terminate() self.listener = None def poll(self): return self.proc.poll() is None def exit_code(self): return self.proc.poll() def read_fileno(self, fileno, execute_finished): decoder_cls = codecs.getincrementaldecoder(self.listener.encoding) decoder = decoder_cls('replace') while True: data = decoder.decode(os.read(fileno, 2**16)) if len(data) > 0: if self.listener: self.listener.on_data(self, data) else: try: os.close(fileno) except OSError: pass if execute_finished and self.listener: self.listener.on_finished(self) break class ExecCommand(sublime_plugin.WindowCommand, ProcessListener): BLOCK_SIZE = 2**14 text_queue = collections.deque() text_queue_proc = None text_queue_lock = threading.Lock() proc = None errs_by_file = {} phantom_sets_by_buffer = {} show_errors_inline = True def run( self, cmd=None, shell_cmd=None, file_regex="", line_regex="", working_dir="", encoding="utf-8", env={}, quiet=False, kill=False, update_phantoms_only=False, hide_phantoms_only=False, word_wrap=True, syntax="Packages/Text/Plain text.tmLanguage", # Catches "path" and "shell" **kwargs): if update_phantoms_only: if self.show_errors_inline: self.update_phantoms() return if hide_phantoms_only: self.hide_phantoms() return # clear the text_queue with self.text_queue_lock: self.text_queue.clear() self.text_queue_proc = None if kill: if self.proc: self.proc.kill() self.proc = None self.append_string(None, "[Cancelled]") return if not hasattr(self, 'output_view'): # Try not to call get_output_panel until the regexes are assigned self.output_view = self.window.create_output_panel("exec") # Default the to the current files directory if no working directory was given if working_dir == "" and self.window.active_view() and self.window.active_view().file_name(): working_dir = os.path.dirname(self.window.active_view().file_name()) self.output_view.settings().set("result_file_regex", file_regex) self.output_view.settings().set("result_line_regex", line_regex) self.output_view.settings().set("result_base_dir", working_dir) self.output_view.settings().set("word_wrap", word_wrap) self.output_view.settings().set("line_numbers", False) self.output_view.settings().set("gutter", False) self.output_view.settings().set("scroll_past_end", False) self.output_view.assign_syntax(syntax) # Call create_output_panel a second time after assigning the above # settings, so that it'll be picked up as a result buffer self.window.create_output_panel("exec") self.encoding = encoding self.quiet = quiet self.proc = None if not self.quiet: if shell_cmd: print("Running " + shell_cmd) elif cmd: cmd_string = cmd if not isinstance(cmd, str): cmd_string = " ".join(cmd) print("Running " + cmd_string) sublime.status_message("Building") show_panel_on_build = sublime.load_settings("Preferences.sublime-settings").get("show_panel_on_build", True) if show_panel_on_build: self.window.run_command("show_panel", {"panel": "output.exec"}) self.hide_phantoms() self.show_errors_inline = sublime.load_settings("Preferences.sublime-settings").get("show_errors_inline", True) merged_env = env.copy() if self.window.active_view(): user_env = self.window.active_view().settings().get('build_env') if user_env: merged_env.update(user_env) # Change to the working dir, rather than spawning the process with it, # so that emitted working dir relative path names make sense if working_dir != "": os.chdir(working_dir) self.debug_text = "" if shell_cmd: self.debug_text += "[shell_cmd: " + shell_cmd + "]\n" else: self.debug_text += "[cmd: " + str(cmd) + "]\n" self.debug_text += "[dir: " + str(os.getcwd()) + "]\n" if "PATH" in merged_env: self.debug_text += "[path: " + str(merged_env["PATH"]) + "]" else: self.debug_text += "[path: " + str(os.environ["PATH"]) + "]" try: # Forward kwargs to AsyncProcess self.proc = AsyncProcess(cmd, shell_cmd, merged_env, self, **kwargs) with self.text_queue_lock: self.text_queue_proc = self.proc except Exception as e: self.append_string(None, str(e) + "\n") self.append_string(None, self.debug_text + "\n") if not self.quiet: self.append_string(None, "[Finished]") def is_enabled(self, kill=False, **kwargs): if kill: return (self.proc is not None) and self.proc.poll() else: return True def append_string(self, proc, str): was_empty = False with self.text_queue_lock: if proc != self.text_queue_proc and proc: # a second call to exec has been made before the first one # finished, ignore it instead of intermingling the output. proc.kill() return if len(self.text_queue) == 0: was_empty = True self.text_queue.append("") available = self.BLOCK_SIZE - len(self.text_queue[-1]) if len(str) < available: cur = self.text_queue.pop() self.text_queue.append(cur + str) else: self.text_queue.append(str) if was_empty: sublime.set_timeout(self.service_text_queue, 0) def service_text_queue(self): is_empty = False with self.text_queue_lock: if len(self.text_queue) == 0: # this can happen if a new build was started, which will clear # the text_queue return characters = self.text_queue.popleft() is_empty = (len(self.text_queue) == 0) self.output_view.run_command( 'append', {'characters': characters, 'force': True, 'scroll_to_end': True}) if self.show_errors_inline and characters.find('\n') >= 0: errs = self.output_view.find_all_results_with_text() errs_by_file = {} for file, line, column, text in errs: if file not in errs_by_file: errs_by_file[file] = [] errs_by_file[file].append((line, column, text)) self.errs_by_file = errs_by_file self.update_phantoms() if not is_empty: sublime.set_timeout(self.service_text_queue, 1) def finish(self, proc): if not self.quiet: elapsed = time.time() - proc.start_time exit_code = proc.exit_code() if exit_code == 0 or exit_code is None: self.append_string(proc, "[Finished in %.1fs]" % elapsed) else: self.append_string(proc, "[Finished in %.1fs with exit code %d]\n" % (elapsed, exit_code)) self.append_string(proc, self.debug_text) if proc != self.proc: return errs = self.output_view.find_all_results() if len(errs) == 0: sublime.status_message("Build finished") else: sublime.status_message("Build finished with %d errors" % len(errs)) def on_data(self, proc, data): # Normalize newlines, Sublime Text always uses a single \n separator # in memory. data = data.replace('\r\n', '\n').replace('\r', '\n') self.append_string(proc, data) def on_finished(self, proc): sublime.set_timeout(functools.partial(self.finish, proc), 0) def update_phantoms(self): stylesheet = ''' <style> div.error-arrow { border-top: 0.4rem solid transparent; border-left: 0.5rem solid color(var(--redish) blend(var(--background) 30%)); width: 0; height: 0; } div.error { padding: 0.4rem 0 0.4rem 0.7rem; margin: 0 0 0.2rem; border-radius: 0 0.2rem 0.2rem 0.2rem; } div.error span.message { padding-right: 0.7rem; } div.error a { text-decoration: inherit; padding: 0.35rem 0.7rem 0.45rem 0.8rem; position: relative; bottom: 0.05rem; border-radius: 0 0.2rem 0.2rem 0; font-weight: bold; } html.dark div.error a { background-color: #00000018; } html.light div.error a { background-color: #ffffff18; } </style> ''' for file, errs in self.errs_by_file.items(): view = self.window.find_open_file(file) if view: buffer_id = view.buffer_id() if buffer_id not in self.phantom_sets_by_buffer: phantom_set = sublime.PhantomSet(view, "exec") self.phantom_sets_by_buffer[buffer_id] = phantom_set else: phantom_set = self.phantom_sets_by_buffer[buffer_id] phantoms = [] for line, column, text in errs: pt = view.text_point(line - 1, column - 1) phantoms.append(sublime.Phantom( sublime.Region(pt, view.line(pt).b), ('<body id=inline-error>' + stylesheet + '<div class="error-arrow"></div><div class="error">' + '<span class="message">' + html.escape(text, quote=False) + '</span>' + '<a href=hide>' + chr(0x00D7) + '</a></div>' + '</body>'), sublime.LAYOUT_BELOW, on_navigate=self.on_phantom_navigate)) phantom_set.update(phantoms) def hide_phantoms(self): for file, errs in self.errs_by_file.items(): view = self.window.find_open_file(file) if view: view.erase_phantoms("exec") self.errs_by_file = {} self.phantom_sets_by_buffer = {} self.show_errors_inline = False def on_phantom_navigate(self, url): self.hide_phantoms() class ExecEventListener(sublime_plugin.EventListener): def on_load(self, view): w = view.window() if w is not None: w.run_command('exec', {'update_phantoms_only': True}) ```
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
在安装Openstack时,提示如下错误:  In file included from /usr/include/openssl/cms.h:16:0, 2018-11-22 16:33:22.028 |                      from build/temp.linux-x86_64-2.7/_openssl.c:485: 2018-11-22 16:33:22.028 |     /usr/include/openssl/x509.h:552:6: note: expected 'const X509_ALGOR ** {aka const struct X509_algor_st **}' but argument is of type 'X509_ALGOR ** {aka struct X509_algor_st **}' 2018-11-22 16:33:22.028 |      void X509_get0_signature(const ASN1_BIT_STRING **psig, 2018-11-22 16:33:22.028 |           ^~~~~~~~~~~~~~~~~~~ 2018-11-22 16:33:22.028 |     At top level: 2018-11-22 16:33:22.029 |     build/temp.linux-x86_64-2.7/_openssl.c:3492:13: warning: '_ssl_thread_locking_function' defined but not used [-Wunused-function] 2018-11-22 16:33:22.029 |      static void _ssl_thread_locking_function(int mode, int n, const char *file, 2018-11-22 16:33:22.029 |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2018-11-22 16:33:22.029 |     error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 2018-11-22 16:33:22.029 | 2018-11-22 16:33:22.029 |     ---------------------------------------- 2018-11-22 16:33:22.070 | Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-dDyHZi/cryptography/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-VeABSz-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-dDyHZi/cryptography/ 2018-11-22 16:33:22.097 | You are using pip version 9.0.3, however version 18.1 is available. 2018-11-22 16:33:22.097 | You should consider upgrading via the 'pip install --upgrade pip' command. 请大神帮忙看看哪里的问题,卡了好几天了,谢谢!
搭建好服务器之后,启动服务器
搭建好服务器之后,启动服务器,出现: DEBUG response status: 404 :/home/tclxa/.virtualenvs/zamboni/lib/python2.6/site-packages/pyelasticsearch/client.py:252 cilent.py文件如下: # -*- coding: utf-8 -*- from __future__ import absolute_import from datetime import datetime from operator import itemgetter from functools import wraps from logging import getLogger import re from six import (iterkeys, binary_type, text_type, string_types, integer_types, iteritems, PY3) from six.moves import xrange try: # PY3 from urllib.parse import urlencode, quote_plus except ImportError: # PY2 from urllib import urlencode, quote_plus import requests import simplejson as json # for use_decimal from simplejson import JSONDecodeError from pyelasticsearch.downtime import DowntimePronePool from pyelasticsearch.exceptions import (Timeout, ConnectionError, ElasticHttpError, InvalidJsonResponseError, ElasticHttpNotFoundError, IndexAlreadyExistsError) def _add_es_kwarg_docs(params, method): """ Add stub documentation for any args in ``params`` that aren't already in the docstring of ``method``. The stubs may not tell much about each arg, but they serve the important purpose of letting the user know that they're safe to use--we won't be paving over them in the future for something pyelasticsearch-specific. """ def docs_for_kwarg(p): return '\n :arg %s: See the ES docs.' % p doc = method.__doc__ if doc is not None: # It's none under python -OO. # Handle the case where there are no :arg declarations to key off: if '\n :arg' not in doc and params: first_param, params = params[0], params[1:] doc = doc.replace('\n (Insert es_kwargs here.)', docs_for_kwarg(first_param)) for p in params: if ('\n :arg %s: ' % p) not in doc: # Find the last documented arg so we can put our generated docs # after it. No need to explicitly compile this; the regex cache # should serve. insertion_point = re.search( r' :arg (.*?)(?=\n+ (?:$|[^: ]))', doc, re.MULTILINE | re.DOTALL).end() doc = ''.join([doc[:insertion_point], docs_for_kwarg(p), doc[insertion_point:]]) method.__doc__ = doc def es_kwargs(*args_to_convert): """ Mark which kwargs will become query string params in the eventual ES call. Return a decorator that grabs the kwargs of the given names, plus any beginning with "es_", subtracts them from the ordinary kwargs, and passes them to the decorated function through the ``query_params`` kwarg. The remaining kwargs and the args are passed through unscathed. Also, if any of the given kwargs are undocumented in the decorated method's docstring, add stub documentation for them. """ convertible_args = set(args_to_convert) def decorator(func): # Add docs for any missing query params: _add_es_kwarg_docs(args_to_convert, func) @wraps(func) def decorate(*args, **kwargs): # Make kwargs the map of normal kwargs and query_params the map of # kwargs destined for query string params: query_params = {} for k in list(iterkeys(kwargs)): # Make a copy; we mutate kwargs. if k.startswith('es_'): query_params[k[3:]] = kwargs.pop(k) elif k in convertible_args: query_params[k] = kwargs.pop(k) return func(*args, query_params=query_params, **kwargs) return decorate return decorator class ElasticSearch(object): """ An object which manages connections to elasticsearch and acts as a go-between for API calls to it This object is thread-safe. You can create one instance and share it among all threads. """ def __init__(self, urls, timeout=60, max_retries=0, revival_delay=300): """ :arg urls: A URL or iterable of URLs of ES nodes. These are full URLs with port numbers, like ``http://elasticsearch.example.com:9200``. :arg timeout: Number of seconds to wait for each request before raising Timeout :arg max_retries: How many other servers to try, in series, after a request times out or a connection fails :arg revival_delay: Number of seconds for which to avoid a server after it times out or is uncontactable """ if isinstance(urls, string_types): urls = [urls] urls = [u.rstrip('/') for u in urls] self.servers = DowntimePronePool(urls, revival_delay) self.revival_delay = revival_delay self.timeout = timeout self.max_retries = max_retries self.logger = getLogger('pyelasticsearch') self.session = requests.session() self.json_encoder = JsonEncoder def _concat(self, items): """ Return a comma-delimited concatenation of the elements of ``items``, with any occurrences of "_all" omitted. If ``items`` is a string, promote it to a 1-item list. """ # TODO: Why strip out _all? if items is None: return '' if isinstance(items, string_types): items = [items] return ','.join(i for i in items if i != '_all') def _to_query(self, obj): """ Convert a native-Python object to a unicode or bytestring representation suitable for a query string. """ # Quick and dirty thus far if isinstance(obj, string_types): return obj if isinstance(obj, bool): return 'true' if obj else 'false' if isinstance(obj, integer_types): return str(obj) if isinstance(obj, float): return repr(obj) # str loses precision. if isinstance(obj, (list, tuple)): return ','.join(self._to_query(o) for o in obj) iso = _iso_datetime(obj) if iso: return iso raise TypeError("_to_query() doesn't know how to represent %r in an ES" ' query string.' % obj) def _utf8(self, thing): """Convert any arbitrary ``thing`` to a utf-8 bytestring.""" if isinstance(thing, binary_type): return thing if not isinstance(thing, text_type): thing = text_type(thing) return thing.encode('utf-8') def _join_path(self, path_components): """ Smush together the path components, omitting '' and None ones. Unicodes get encoded to strings via utf-8. Incoming strings are assumed to be utf-8-encoded already. """ path = '/'.join(quote_plus(self._utf8(p), '') for p in path_components if p is not None and p != '') if not path.startswith('/'): path = '/' + path return path def send_request(self, method, path_components, body='', query_params=None, encode_body=True): """ Send an HTTP request to ES, and return the JSON-decoded response. This is mostly an internal method, but it also comes in handy if you need to use a brand new ES API that isn't yet explicitly supported by pyelasticsearch, while still taking advantage of our connection pooling and retrying. Retry the request on different servers if the first one is down and ``self.max_retries`` > 0. :arg method: An HTTP method, like "GET" :arg path_components: An iterable of path components, to be joined by "/" :arg body: The request body :arg query_params: A map of querystring param names to values or ``None`` :arg encode_body: Whether to encode the body of the request as JSON """ path = self._join_path(path_components) if query_params: path = '?'.join( [path, urlencode(dict((k, self._utf8(self._to_query(v))) for k, v in iteritems(query_params)))]) request_body = self._encode_json(body) if encode_body else body req_method = getattr(self.session, method.lower()) # We do our own retrying rather than using urllib3's; we want to retry # a different node in the cluster if possible, not the same one again # (which may be down). for attempt in xrange(self.max_retries + 1): server_url, was_dead = self.servers.get() url = server_url + path self.logger.debug( "Making a request equivalent to this: curl -X%s '%s' -d '%s'", method, url, request_body) try: resp = req_method( url, timeout=self.timeout, **({'data': request_body} if body else {})) except (ConnectionError, Timeout): self.servers.mark_dead(server_url) self.logger.info('%s marked as dead for %s seconds.', server_url, self.revival_delay) if attempt >= self.max_retries: raise else: if was_dead: self.servers.mark_live(server_url) break self.logger.debug('response status: %s', resp.status_code) print'****************************************************' print resp.status_code print'****************************************************' prepped_response = self._decode_response(resp) if resp.status_code >= 400: self._raise_exception(resp, prepped_response) self.logger.debug('got response %s', prepped_response) return prepped_response def _raise_exception(self, response, decoded_body): """Raise an exception based on an error-indicating response from ES.""" error_message = decoded_body.get('error', decoded_body) error_class = ElasticHttpError if response.status_code == 404: error_class = ElasticHttpNotFoundError elif (error_message.startswith('IndexAlreadyExistsException') or 'nested: IndexAlreadyExistsException' in error_message): error_class = IndexAlreadyExistsError raise error_class(response.status_code, error_message) def _encode_json(self, value): """ Convert a Python value to a form suitable for ElasticSearch's JSON DSL. """ return json.dumps(value, cls=self.json_encoder, use_decimal=True) def _decode_response(self, response): """Return a native-Python representation of a response's JSON blob.""" try: json_response = response.json() except JSONDecodeError: raise InvalidJsonResponseError(response) return json_response ## REST API @es_kwargs('routing', 'parent', 'timestamp', 'ttl', 'percolate', 'consistency', 'replication', 'refresh', 'timeout', 'fields') def index(self, index, doc_type, doc, id=None, overwrite_existing=True, query_params=None): """ Put a typed JSON document into a specific index to make it searchable. :arg index: The name of the index to which to add the document :arg doc_type: The type of the document :arg doc: A Python mapping object, convertible to JSON, representing the document :arg id: The ID to give the document. Leave blank to make one up. :arg overwrite_existing: Whether we should overwrite existing documents of the same ID and doctype :arg routing: A value hashed to determine which shard this indexing request is routed to :arg parent: The ID of a parent document, which leads this document to be routed to the same shard as the parent, unless ``routing`` overrides it. :arg timestamp: An explicit value for the (typically automatic) timestamp associated with a document, for use with ``ttl`` and such :arg ttl: The time until this document is automatically removed from the index. Can be an integral number of milliseconds or a duration like '1d'. :arg percolate: An indication of which percolator queries, registered against this index, should be checked against the new document: '*' or a query string like 'color:green' :arg consistency: An indication of how many active shards the contact node should demand to see in order to let the index operation succeed: 'one', 'quorum', or 'all' :arg replication: Set to 'async' to return from ES before finishing replication. :arg refresh: Pass True to refresh the index after adding the document. :arg timeout: A duration to wait for the relevant primary shard to become available, in the event that it isn't: for example, "5m" See `ES's index API`_ for more detail. .. _`ES's index API`: http://www.elasticsearch.org/guide/reference/api/index_.html """ # :arg query_params: A map of other querystring params to pass along to # ES. This lets you use future ES features without waiting for an # update to pyelasticsearch. If we just used **kwargs for this, ES # could start using a querystring param that we already used as a # kwarg, and we'd shadow it. Name these params according to the names # they have in ES's REST API, but prepend "\es_": for example, # ``es_version=2``. # TODO: Support version along with associated "preference" and # "version_type" params. if not overwrite_existing: query_params['op_type'] = 'create' return self.send_request('POST' if id is None else 'PUT', [index, doc_type, id], doc, query_params) @es_kwargs('consistency', 'refresh') def bulk_index(self, index, doc_type, docs, id_field='id', parent_field='_parent', query_params=None): """ Index a list of documents as efficiently as possible. :arg index: The name of the index to which to add the document :arg doc_type: The type of the document :arg docs: An iterable of Python mapping objects, convertible to JSON, representing documents to index :arg id_field: The field of each document that holds its ID :arg parent_field: The field of each document that holds its parent ID, if any. Removed from document before indexing. See `ES's bulk API`_ for more detail. .. _`ES's bulk API`: http://www.elasticsearch.org/guide/reference/api/bulk.html """ body_bits = [] if not docs: raise ValueError('No documents provided for bulk indexing!') for doc in docs: action = {'index': {'_index': index, '_type': doc_type}} if doc.get(id_field) is not None: action['index']['_id'] = doc[id_field] if doc.get(parent_field) is not None: action['index']['_parent'] = doc.pop(parent_field) body_bits.append(self._encode_json(action)) body_bits.append(self._encode_json(doc)) # Need the trailing newline. body = '\n'.join(body_bits) + '\n' return self.send_request('POST', ['_bulk'], body, encode_body=False, query_params=query_params) @es_kwargs('routing', 'parent', 'replication', 'consistency', 'refresh') def delete(self, index, doc_type, id, query_params=None): """ Delete a typed JSON document from a specific index based on its ID. :arg index: The name of the index from which to delete :arg doc_type: The type of the document to delete :arg id: The (string or int) ID of the document to delete See `ES's delete API`_ for more detail. .. _`ES's delete API`: http://www.elasticsearch.org/guide/reference/api/delete.html """ # id should never be None, and it's not particular dangerous # (equivalent to deleting a doc with ID "None", but it's almost # certainly not what the caller meant: if id is None or id == '': raise ValueError('No ID specified. To delete all documents in ' 'an index, use delete_all().') return self.send_request('DELETE', [index, doc_type, id], query_params=query_params) @es_kwargs('routing', 'parent', 'replication', 'consistency', 'refresh') def delete_all(self, index, doc_type, query_params=None): """ Delete all documents of the given doctype from an index. :arg index: The name of the index from which to delete. ES does not support this being empty or "_all" or a comma-delimited list of index names (in 0.19.9). :arg doc_type: The name of a document type See `ES's delete API`_ for more detail. .. _`ES's delete API`: http://www.elasticsearch.org/guide/reference/api/delete.html """ return self.send_request('DELETE', [index, doc_type], query_params=query_params) @es_kwargs('q', 'df', 'analyzer', 'default_operator', 'source' 'routing', 'replication', 'consistency') def delete_by_query(self, index, doc_type, query, query_params=None): """ Delete typed JSON documents from a specific index based on query. :arg index: An index or iterable thereof from which to delete :arg doc_type: The type of document or iterable thereof to delete :arg query: A dictionary that will convert to ES's query DSL or a string that will serve as a textual query to be passed as the ``q`` query string parameter. (Passing the ``q`` kwarg yourself is deprecated.) See `ES's delete-by-query API`_ for more detail. .. _`ES's delete-by-query API`: http://www.elasticsearch.org/guide/reference/api/delete-by-query.html """ if isinstance(query, string_types) and 'q' not in query_params: query_params['q'] = query body = '' else: body = query return self.send_request( 'DELETE', [self._concat(index), self._concat(doc_type), '_query'], body, query_params=query_params) @es_kwargs('realtime', 'fields', 'routing', 'preference', 'refresh') def get(self, index, doc_type, id, query_params=None): """ Get a typed JSON document from an index by ID. :arg index: The name of the index from which to retrieve :arg doc_type: The type of document to get :arg id: The ID of the document to retrieve See `ES's get API`_ for more detail. .. _`ES's get API`: http://www.elasticsearch.org/guide/reference/api/get.html """ return self.send_request('GET', [index, doc_type, id], query_params=query_params) @es_kwargs() def multi_get(self, ids, index=None, doc_type=None, fields=None, query_params=None): """ Get multiple typed JSON documents from ES. :arg ids: An iterable, each element of which can be either an a dict or an id (int or string). IDs are taken to be document IDs. Dicts are passed through the Multi Get API essentially verbatim, except that any missing ``_type``, ``_index``, or ``fields`` keys are filled in from the defaults given in the ``index``, ``doc_type``, and ``fields`` args. :arg index: Default index name from which to retrieve :arg doc_type: Default type of document to get :arg fields: Default fields to return See `ES's Multi Get API`_ for more detail. .. _`ES's Multi Get API`: http://www.elasticsearch.org/guide/reference/api/multi-get.html """ doc_template = dict( filter( itemgetter(1), [('_index', index), ('_type', doc_type), ('fields', fields)])) docs = [] for id in ids: doc = doc_template.copy() if isinstance(id, dict): doc.update(id) else: doc['_id'] = id docs.append(doc) return self.send_request( 'GET', ['_mget'], {'docs': docs}, query_params=query_params) @es_kwargs('routing', 'parent', 'timeout', 'replication', 'consistency', 'percolate', 'refresh', 'retry_on_conflict', 'fields') def update(self, index, doc_type, id, script=None, params=None, lang=None, query_params=None, doc=None, upsert=None): """ Update an existing document. Raise ``TypeError`` if ``script``, ``doc`` and ``upsert`` are all unspecified. :arg index: The name of the index containing the document :arg doc_type: The type of the document :arg id: The ID of the document :arg script: The script to be used to update the document :arg params: A dict of the params to be put in scope of the script :arg lang: The language of the script. Omit to use the default, specified by ``script.default_lang``. :arg doc: A partial document to be merged into the existing document :arg upsert: The content for the new document created if the document does not exist """ if script is None and doc is None and upsert is None: raise TypeError('At least one of the script, doc, or upsert ' 'kwargs must be provided.') body = {} if script: body['script'] = script if lang and script: body['lang'] = lang if doc: body['doc'] = doc if upsert: body['upsert'] = upsert if params: body['params'] = params return self.send_request( 'POST', [index, doc_type, id, '_update'], body=body, query_params=query_params) def _search_or_count(self, kind, query, index=None, doc_type=None, query_params=None): if isinstance(query, string_types): query_params['q'] = query body = '' else: body = query return self.send_request( 'GET', [self._concat(index), self._concat(doc_type), kind], body, query_params=query_params) @es_kwargs('routing', 'size') def search(self, query, **kwargs): """ Execute a search query against one or more indices and get back search hits. :arg query: A dictionary that will convert to ES's query DSL or a string that will serve as a textual query to be passed as the ``q`` query string parameter :arg index: An index or iterable of indexes to search. Omit to search all. :arg doc_type: A document type or iterable thereof to search. Omit to search all. :arg size: Limit the number of results to ``size``. Use with ``es_from`` to implement paginated searching. See `ES's search API`_ for more detail. .. _`ES's search API`: http://www.elasticsearch.org/guide/reference/api/search/ """ return self._search_or_count('_search', query, **kwargs) @es_kwargs('df', 'analyzer', 'default_operator', 'source', 'routing') def count(self, query, **kwargs): """ Execute a query against one or more indices and get hit count. :arg query: A dictionary that will convert to ES's query DSL or a string that will serve as a textual query to be passed as the ``q`` query string parameter :arg index: An index or iterable of indexes to search. Omit to search all. :arg doc_type: A document type or iterable thereof to search. Omit to search all. See `ES's count API`_ for more detail. .. _`ES's count API`: http://www.elasticsearch.org/guide/reference/api/count.html """ return self._search_or_count('_count', query, **kwargs) @es_kwargs() def get_mapping(self, index=None, doc_type=None, query_params=None): """ Fetch the mapping definition for a specific index and type. :arg index: An index or iterable thereof :arg doc_type: A document type or iterable thereof Omit both arguments to get mappings for all types and indexes. See `ES's get-mapping API`_ for more detail. .. _`ES's get-mapping API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-get-mapping.html """ # TODO: Think about turning index=None into _all if doc_type is non- # None, per the ES doc page. return self.send_request( 'GET', [self._concat(index), self._concat(doc_type), '_mapping'], query_params=query_params) @es_kwargs('ignore_conflicts') def put_mapping(self, index, doc_type, mapping, query_params=None): """ Register specific mapping definition for a specific type against one or more indices. :arg index: An index or iterable thereof :arg doc_type: The document type to set the mapping of :arg mapping: A dict representing the mapping to install. For example, this dict can have top-level keys that are the names of doc types. See `ES's put-mapping API`_ for more detail. .. _`ES's put-mapping API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html """ # TODO: Perhaps add a put_all_mappings() for consistency and so we # don't need to expose the "_all" magic string. We haven't done it yet # since this routine is not dangerous: ES makes you explicily pass # "_all" to update all mappings. return self.send_request( 'PUT', [self._concat(index), doc_type, '_mapping'], mapping, query_params=query_params) @es_kwargs('search_type', 'search_indices', 'search_types', 'search_scroll', 'search_size', 'search_from', 'like_text', 'percent_terms_to_match', 'min_term_freq', 'max_query_terms', 'stop_words', 'min_doc_freq', 'max_doc_freq', 'min_word_len', 'max_word_len', 'boost_terms', 'boost', 'analyzer') def more_like_this(self, index, doc_type, id, mlt_fields, body='', query_params=None): """ Execute a "more like this" search query against one or more fields and get back search hits. :arg index: The index to search and where the document for comparison lives :arg doc_type: The type of document to find others like :arg id: The ID of the document to find others like :arg mlt_fields: The list of fields to compare on :arg body: A dictionary that will convert to ES's query DSL and be passed as the request body See `ES's more-like-this API`_ for more detail. .. _`ES's more-like-this API`: http://www.elasticsearch.org/guide/reference/api/more-like-this.html """ query_params['mlt_fields'] = self._concat(mlt_fields) return self.send_request('GET', [index, doc_type, id, '_mlt'], body=body, query_params=query_params) ## Index Admin API @es_kwargs('recovery', 'snapshot') def status(self, index=None, query_params=None): """ Retrieve the status of one or more indices :arg index: An index or iterable thereof See `ES's index-status API`_ for more detail. .. _`ES's index-status API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-status.html """ return self.send_request('GET', [self._concat(index), '_status'], query_params=query_params) @es_kwargs() def update_aliases(self, settings, query_params=None): """ Add, remove, or update aliases in bulk. :arg settings: a dictionary specifying the actions to perform See `ES's admin-indices-aliases API`_. .. _`ES's admin-indices-aliases API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html """ return self.send_request('POST', ['_aliases'], body=settings, query_params=query_params) @es_kwargs() def aliases(self, index=None, query_params=None): """ Retrieve a listing of aliases :arg index: the name of an index or an iterable of indices See `ES's admin-indices-aliases API`_. .. _`ES's admin-indices-aliases API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html """ return self.send_request('GET', [self._concat(index), '_aliases'], query_params=query_params) @es_kwargs() def create_index(self, index, settings=None, query_params=None): """ Create an index with optional settings. :arg index: The name of the index to create :arg settings: A dictionary of settings If the index already exists, raise :class:`~pyelasticsearch.exceptions.IndexAlreadyExistsError`. See `ES's create-index API`_ for more detail. .. _`ES's create-index API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.html """ return self.send_request('PUT', [index], body=settings, query_params=query_params) @es_kwargs() def delete_index(self, index, query_params=None): """ Delete an index. :arg index: An index or iterable thereof to delete If the index is not found, raise :class:`~pyelasticsearch.exceptions.ElasticHttpNotFoundError`. See `ES's delete-index API`_ for more detail. .. _`ES's delete-index API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-delete-index.html """ if not index: raise ValueError('No indexes specified. To delete all indexes, use' ' delete_all_indexes().') return self.send_request('DELETE', [self._concat(index)], query_params=query_params) def delete_all_indexes(self, **kwargs): """Delete all indexes.""" return self.delete_index('_all', **kwargs) @es_kwargs() def close_index(self, index, query_params=None): """ Close an index. :arg index: The index to close See `ES's close-index API`_ for more detail. .. _`ES's close-index API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-open-close.html """ return self.send_request('POST', [index, '_close'], query_params=query_params) @es_kwargs() def open_index(self, index, query_params=None): """ Open an index. :arg index: The index to open See `ES's open-index API`_ for more detail. .. _`ES's open-index API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-open-close.html """ return self.send_request('POST', [index, '_open'], query_params=query_params) @es_kwargs() def get_settings(self, index, query_params=None): """ Get the settings of one or more indexes. :arg index: An index or iterable of indexes See `ES's get-settings API`_ for more detail. .. _`ES's get-settings API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-get-settings.html """ return self.send_request('GET', [self._concat(index), '_settings'], query_params=query_params) @es_kwargs() def update_settings(self, index, settings, query_params=None): """ Change the settings of one or more indexes. :arg index: An index or iterable of indexes :arg settings: A dictionary of settings See `ES's update-settings API`_ for more detail. .. _`ES's update-settings API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html """ if not index: raise ValueError('No indexes specified. To update all indexes, use' ' update_all_settings().') # If we implement the "update cluster settings" API, call that # update_cluster_settings(). return self.send_request('PUT', [self._concat(index), '_settings'], body=settings, query_params=query_params) @es_kwargs() def update_all_settings(self, settings, query_params=None): """ Update the settings of all indexes. :arg settings: A dictionary of settings See `ES's update-settings API`_ for more detail. .. _`ES's update-settings API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html """ return self.send_request('PUT', ['_settings'], body=settings, query_params=query_params) @es_kwargs('refresh') def flush(self, index=None, query_params=None): """ Flush one or more indices (clear memory). :arg index: An index or iterable of indexes See `ES's flush API`_ for more detail. .. _`ES's flush API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-flush.html """ return self.send_request('POST', [self._concat(index), '_flush'], query_params=query_params) @es_kwargs() def refresh(self, index=None, query_params=None): """ Refresh one or more indices. :arg index: An index or iterable of indexes See `ES's refresh API`_ for more detail. .. _`ES's refresh API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html """ return self.send_request('POST', [self._concat(index), '_refresh'], query_params=query_params) @es_kwargs() def gateway_snapshot(self, index=None, query_params=None): """ Gateway snapshot one or more indices. :arg index: An index or iterable of indexes See `ES's gateway-snapshot API`_ for more detail. .. _`ES's gateway-snapshot API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-gateway-snapshot.html """ return self.send_request( 'POST', [self._concat(index), '_gateway', 'snapshot'], query_params=query_params) @es_kwargs('max_num_segments', 'only_expunge_deletes', 'refresh', 'flush', 'wait_for_merge') def optimize(self, index=None, query_params=None): """ Optimize one or more indices. :arg index: An index or iterable of indexes See `ES's optimize API`_ for more detail. .. _`ES's optimize API`: http://www.elasticsearch.org/guide/reference/api/admin-indices-optimize.html """ return self.send_request('POST', [self._concat(index), '_optimize'], query_params=query_params) @es_kwargs('level', 'wait_for_status', 'wait_for_relocating_shards', 'wait_for_nodes', 'timeout') def health(self, index=None, query_params=None): """ Report on the health of the cluster or certain indices. :arg index: The index or iterable of indexes to examine See `ES's cluster-health API`_ for more detail. .. _`ES's cluster-health API`: http://www.elasticsearch.org/guide/reference/api/admin-cluster-health.html """ return self.send_request( 'GET', ['_cluster', 'health', self._concat(index)], query_params=query_params) @es_kwargs('filter_nodes', 'filter_routing_table', 'filter_metadata', 'filter_blocks', 'filter_indices') def cluster_state(self, query_params=None): """ The cluster state API allows to get comprehensive state information of the whole cluster. (Insert es_kwargs here.) See `ES's cluster-state API`_ for more detail. .. _`ES's cluster-state API`: http://www.elasticsearch.org/guide/reference/api/admin-cluster-state.html """ return self.send_request( 'GET', ['_cluster', 'state'], query_params=query_params) @es_kwargs() def percolate(self, index, doc_type, doc, query_params=None): """ Run a JSON document through the registered percolator queries, and return which ones match. :arg index: The name of the index to which the document pretends to belong :arg doc_type: The type the document should be treated as if it has :arg doc: A Python mapping object, convertible to JSON, representing the document Use :meth:`index()` to register percolators. See `ES's percolate API`_ for more detail. .. _`ES's percolate API`: http://www.elasticsearch.org/guide/reference/api/percolate/ """ return self.send_request('GET', [index, doc_type, '_percolate'], doc, query_params=query_params) class JsonEncoder(json.JSONEncoder): def default(self, value): """Convert more Python data types to ES-understandable JSON.""" iso = _iso_datetime(value) if iso: return iso if not PY3 and isinstance(value, str): return unicode(value, errors='replace') # TODO: Be stricter. if isinstance(value, set): return list(value) return super(JsonEncoder, self).default(value) def _iso_datetime(value): """ If value appears to be something datetime-like, return it in ISO format. Otherwise, return None. """ if hasattr(value, 'strftime'): if hasattr(value, 'hour'): return value.isoformat() else: return '%sT00:00:00' % value.isoformat()
GitHub下载的豆瓣爬虫程序,运行出现Process finished with exit code 0,求助
本人不是很懂爬虫,用的别人上传豆瓣爬虫的代码运行不出来结果也没有报错,百度了一下说是因为没有写输出,但我真的脑汁都要榨干了也编不出来代码,跪求帮助,如果大神你看到又恰好有时间,拜托帮帮我这个渣渣吧。 代码如下 ``` import scrapy import sys import re import random # from douban.items import UserItem from douban.items import DoubanItem class CollectSpider(scrapy.Spider): name = 'collect_spider' user_ids = [121253425] crawl_types = ['collect_movie'] douban_types = ['collect_movie', 'wise_movie'] # 看过的电影、想看的电影、在看的电影 url templates collect_movie_tpl = 'https://movie.douban.com/people/vividtime/collect?start={}' wish_movie_tpl = 'https://movie.douban.com/people/vividtime/wish?start={}' do_movie_tpl = 'https://movie.douban.com/people/vividtime/do?start={}' douban_url_templates = { 'collect_movie': collect_movie_tpl, 'wish_movie': wish_movie_tpl, 'do_movie': do_movie_tpl, } cookies_list = [] def get_random_cookies(self): """ get a random cookies in cookies_list """ cookies = random.choice(self.cookies_list) rt = {} for item in cookies.split(';'): key, value = item.split('=')[0].strip(), item.split('=')[1].strip() rt[key] = value return rt def start_requests(self): """ entry of this spider """ for user_id in self.user_ids: for douban_type in self.douban_types: if douban_type in self.crawl_types: url_template = self.douban_url_templates[douban_type] if url_template: request_url = url_template.format(user_id, 0) + '&' else: raise Exception('Wrong douban_type: %s' % douban_type) meta = { 'url_template': url_template, 'douban_type': douban_type, 'user_id': user_id } yield scrapy.Request(url=request_url, meta=meta, callback=self.parse_first_page) def parse_first_page(self, response): """ parse first page to get total pages and generate more requests """ meta = response.meta total_page = response.xpath('//*[@id="content"]//div[@class="paginator"]/a[last()]/text()').extract() total_page = int(total_page[0]) if total_page else 1 for page in range(total_page): request_url = meta['url_template'].format(meta['user_id'], 15 * page) yield scrapy.Request(url=request_url, meta=meta, callback=self.parse_content) def parse_content(self, response): """ parse collect item of response """ meta = response.meta item_list = response.xpath('//*[@id="content"]//div[@class="grid-view"]/div') for item in item_list: douban_item = DoubanItem() douban_item['item_type'] = meta['douban_type'] douban_item['user_id'] = meta['user_id'] item_id = item.xpath('div[@class="info"]/ul/li[@class="title"]/a/@href').extract() douban_item['item_id'] = int(item_id[0].split('/')[-2]) if item_id else None item_name = item.xpath('div[@class="info"]/ul/li[@class="title"]/a/em/text()').extract() douban_item['item_name'] = item_name[0].strip() if item_name else None item_other_name = item.xpath('div[@class="info"]/ul/li[@class="title"]/a').xpath('string(.)').extract() douban_item['item_other_name'] = item_other_name[0].replace(douban_item['item_name'], '').strip() if item_other_name else None item_intro = item.xpath('div[@class="info"]/ul/li[@class="intro"]/text()').extract() douban_item['item_intro'] = item_intro[0].strip() if item_intro else None item_rating = item.xpath('div[@class="info"]/ul/li[3]/*[starts-with(@class, "rating")]/@class').extract() douban_item['item_rating'] = int(item_rating[0][6]) if item_rating else None item_date = item.xpath('div[@class="info"]/ul/li[3]/*[@class="date"]/text()').extract() douban_item['item_date'] = item_date[0] if item_date else None item_tags = item.xpath('div[@class="info"]/ul/li[3]/*[@class="tags"]/text()').extract() douban_item['item_tags'] = item_tags[0].replace(u'标签: ', '') if item_tags else None item_comment = item.xpath('div[@class="info"]/ul/li[4]/*[@class="comment"]/text()').extract() douban_item['item_comment'] = item_comment[0] if item_comment else None item_poster_id = item.xpath('div[@class="pic"]/a[1]/img[1]/@src').extract() douban_item['item_poster_id'] = int(item_poster_id[0].split('/')[-1].split('.')[0][1:]) if item_poster_id else None yield douban_item ```
终于明白阿里百度这样的大公司,为什么面试经常拿ThreadLocal考验求职者了
点击上面↑「爱开发」关注我们每晚10点,捕获技术思考和创业资源洞察什么是ThreadLocalThreadLocal是一个本地线程副本变量工具类,各个线程都拥有一份线程私...
《奇巧淫技》系列-python!!每天早上八点自动发送天气预报邮件到QQ邮箱
将代码部署服务器,每日早上定时获取到天气数据,并发送到邮箱。 也可以说是一个小人工智障。 思路可以运用在不同地方,主要介绍的是思路。
加快推动区块链技术和产业创新发展,2019可信区块链峰会在京召开
11月8日,由中国信息通信研究院、中国通信标准化协会、中国互联网协会、可信区块链推进计划联合主办,科技行者协办的2019可信区块链峰会将在北京悠唐皇冠假日酒店开幕。   区块链技术被认为是继蒸汽机、电力、互联网之后,下一代颠覆性的核心技术。如果说蒸汽机释放了人类的生产力,电力解决了人类基本的生活需求,互联网彻底改变了信息传递的方式,区块链作为构造信任的技术有重要的价值。   1...
阿里面试官问我:如何设计秒杀系统?我的回答让他比起大拇指
你知道的越多,你不知道的越多 点赞再看,养成习惯 GitHub上已经开源 https://github.com/JavaFamily 有一线大厂面试点脑图和个人联系方式,欢迎Star和指教 前言 Redis在互联网技术存储方面使用如此广泛,几乎所有的后端技术面试官都要在Redis的使用和原理方面对小伙伴们进行360°的刁难。 作为一个在互联网公司面一次拿一次Offer的面霸,打败了...
C语言魔塔游戏
很早就很想写这个,今天终于写完了。 游戏截图: 编译环境: VS2017 游戏需要一些图片,如果有想要的或者对游戏有什么看法的可以加我的QQ 2985486630 讨论,如果暂时没有回应,可以在博客下方留言,到时候我会看到。 下面我来介绍一下游戏的主要功能和实现方式 首先是玩家的定义,使用结构体,这个名字是可以自己改变的 struct gamerole { char n...
面试官问我:什么是消息队列?什么场景需要他?用了会出现什么问题?
你知道的越多,你不知道的越多 点赞再看,养成习惯 GitHub上已经开源 https://github.com/JavaFamily 有一线大厂面试点脑图、个人联系方式和人才交流群,欢迎Star和完善 前言 消息队列在互联网技术存储方面使用如此广泛,几乎所有的后端技术面试官都要在消息队列的使用和原理方面对小伙伴们进行360°的刁难。 作为一个在互联网公司面一次拿一次Offer的面霸...
Android性能优化(4):UI渲染机制以及优化
文章目录1. 渲染机制分析1.1 渲染机制1.2 卡顿现象1.3 内存抖动2. 渲染优化方式2.1 过度绘制优化2.1.1 Show GPU overdraw2.1.2 Profile GPU Rendering2.2 卡顿优化2.2.1 SysTrace2.2.2 TraceView 在从Android 6.0源码的角度剖析View的绘制原理一文中,我们了解到View的绘制流程有三个步骤,即m...
微服务中的Kafka与Micronaut
今天,我们将通过Apache Kafka主题构建一些彼此异步通信的微服务。我们使用Micronaut框架,它为与Kafka集成提供专门的库。让我们简要介绍一下示例系统的体系结构。我们有四个微型服务:订单服务,行程服务,司机服务和乘客服务。这些应用程序的实现非常简单。它们都有内存存储,并连接到同一个Kafka实例。 我们系统的主要目标是为客户安排行程。订单服务应用程序还充当网关。它接收来自客户的请求...
致 Python 初学者们!
作者| 许向武 责编 | 屠敏 出品 | CSDN 博客 前言 在 Python 进阶的过程中,相信很多同学应该大致上学习了很多 Python 的基础知识,也正在努力成长。在此期间,一定遇到了很多的困惑,对未来的学习方向感到迷茫。我非常理解你们所面临的处境。我从2007年开始接触 Python 这门编程语言,从2009年开始单一使用 Python 应对所有的开发工作,直至今...
究竟你适不适合买Mac?
我清晰的记得,刚买的macbook pro回到家,开机后第一件事情,就是上了淘宝网,花了500元钱,找了一个上门维修电脑的师傅,上门给我装了一个windows系统。。。。。。 表砍我。。。 当时买mac的初衷,只是想要个固态硬盘的笔记本,用来运行一些复杂的扑克软件。而看了当时所有的SSD笔记本后,最终决定,还是买个好(xiong)看(da)的。 已经有好几个朋友问我mba怎么样了,所以今天尽量客观...
程序员一般通过什么途径接私活?
二哥,你好,我想知道一般程序猿都如何接私活,我也想接,能告诉我一些方法吗? 上面是一个读者“烦不烦”问我的一个问题。其实不止是“烦不烦”,还有很多读者问过我类似这样的问题。 我接的私活不算多,挣到的钱也没有多少,加起来不到 20W。说实话,这个数目说出来我是有点心虚的,毕竟太少了,大家轻喷。但我想,恰好配得上“一般程序员”这个称号啊。毕竟苍蝇再小也是肉,我也算是有经验的人了。 唾弃接私活、做外...
字节跳动面试官这样问消息队列:分布式事务、重复消费、顺序消费,我整理了一下
你知道的越多,你不知道的越多 点赞再看,养成习惯 GitHub上已经开源 https://github.com/JavaFamily 有一线大厂面试点脑图、个人联系方式和人才交流群,欢迎Star和完善 前言 消息队列在互联网技术存储方面使用如此广泛,几乎所有的后端技术面试官都要在消息队列的使用和原理方面对小伙伴们进行360°的刁难。 作为一个在互联网公司面一次拿一次Offer的面霸...
Python爬虫爬取淘宝,京东商品信息
小编是一个理科生,不善长说一些废话。简单介绍下原理然后直接上代码。 使用的工具(Python+pycharm2019.3+selenium+xpath+chromedriver)其中要使用pycharm也可以私聊我selenium是一个框架可以通过pip下载 pip installselenium -ihttps://pypi.tuna.tsinghua.edu.cn/simple/ ...
阿里程序员写了一个新手都写不出的低级bug,被骂惨了。
这种新手都不会范的错,居然被一个工作好几年的小伙子写出来,差点被当场开除了。
Java工作4年来应聘要16K最后没要,细节如下。。。
前奏: 今天2B哥和大家分享一位前几天面试的一位应聘者,工作4年26岁,统招本科。 以下就是他的简历和面试情况。 基本情况: 专业技能: 1、&nbsp;熟悉Sping了解SpringMVC、SpringBoot、Mybatis等框架、了解SpringCloud微服务 2、&nbsp;熟悉常用项目管理工具:SVN、GIT、MAVEN、Jenkins 3、&nbsp;熟悉Nginx、tomca...
SpringBoot2.x系列教程(三十六)SpringBoot之Tomcat配置
Spring Boot默认内嵌的Tomcat为Servlet容器,关于Tomcat的所有属性都在ServerProperties配置类中。同时,也可以实现一些接口来自定义内嵌Servlet容器和内嵌Tomcat等的配置。 关于此配置,网络上有大量的资料,但都是基于SpringBoot1.5.x版本,并不适合当前最新版本。本文将带大家了解一下最新版本的使用。 ServerProperties的部分源...
Python绘图,圣诞树,花,爱心 | Turtle篇
每周每日,分享Python实战代码,入门资料,进阶资料,基础语法,爬虫,数据分析,web网站,机器学习,深度学习等等。 公众号回复【进群】沟通交流吧,QQ扫码进群学习吧 微信群 QQ群 1.画圣诞树 import turtle screen = turtle.Screen() screen.setup(800,600) circle = turtle.Turtle()...
作为一个程序员,CPU的这些硬核知识你必须会!
CPU对每个程序员来说,是个既熟悉又陌生的东西? 如果你只知道CPU是中央处理器的话,那可能对你并没有什么用,那么作为程序员的我们,必须要搞懂的就是CPU这家伙是如何运行的,尤其要搞懂它里面的寄存器是怎么一回事,因为这将让你从底层明白程序的运行机制。 随我一起,来好好认识下CPU这货吧 把CPU掰开来看 对于CPU来说,我们首先就要搞明白它是怎么回事,也就是它的内部构造,当然,CPU那么牛的一个东...
破14亿,Python分析我国存在哪些人口危机!
一、背景 二、爬取数据 三、数据分析 1、总人口 2、男女人口比例 3、人口城镇化 4、人口增长率 5、人口老化(抚养比) 6、各省人口 7、世界人口 四、遇到的问题 遇到的问题 1、数据分页,需要获取从1949-2018年数据,观察到有近20年参数:LAST20,由此推测获取近70年的参数可设置为:LAST70 2、2019年数据没有放上去,可以手动添加上去 3、将数据进行 行列转换 4、列名...
听说想当黑客的都玩过这个Monyer游戏(1~14攻略)
第零关 进入传送门开始第0关(游戏链接) 请点击链接进入第1关: 连接在左边→ ←连接在右边 看不到啊。。。。(只能看到一堆大佬做完的留名,也能看到菜鸡的我,在后面~~) 直接fn+f12吧 &lt;span&gt;连接在左边→&lt;/span&gt; &lt;a href="first.php"&gt;&lt;/a&gt; &lt;span&gt;←连接在右边&lt;/span&gt; o...
在家远程办公效率低?那你一定要收好这个「在家办公」神器!
相信大家都已经收到国务院延长春节假期的消息,接下来,在家远程办公可能将会持续一段时间。 但是问题来了。远程办公不是人在电脑前就当坐班了,相反,对于沟通效率,文件协作,以及信息安全都有着极高的要求。有着非常多的挑战,比如: 1在异地互相不见面的会议上,如何提高沟通效率? 2文件之间的来往反馈如何做到及时性?如何保证信息安全? 3如何规划安排每天工作,以及如何进行成果验收? ...... ...
作为一个程序员,内存和磁盘的这些事情,你不得不知道啊!!!
截止目前,我已经分享了如下几篇文章: 一个程序在计算机中是如何运行的?超级干货!!! 作为一个程序员,CPU的这些硬核知识你必须会! 作为一个程序员,内存的这些硬核知识你必须懂! 这些知识可以说是我们之前都不太重视的基础知识,可能大家在上大学的时候都学习过了,但是嘞,当时由于老师讲解的没那么有趣,又加上这些知识本身就比较枯燥,所以嘞,大家当初几乎等于没学。 再说啦,学习这些,也看不出来有什么用啊!...
这个世界上人真的分三六九等,你信吗?
偶然间,在知乎上看到一个问题 一时间,勾起了我深深的回忆。 以前在厂里打过两次工,做过家教,干过辅导班,做过中介。零下几度的晚上,贴过广告,满脸、满手地长冻疮。 再回首那段岁月,虽然苦,但让我学会了坚持和忍耐。让我明白了,在这个世界上,无论环境多么的恶劣,只要心存希望,星星之火,亦可燎原。 下文是原回答,希望能对你能有所启发。 如果我说,这个世界上人真的分三六九等,...
2020年全新Java学习路线图,含配套视频,学完即为中级Java程序员!!
新的一年来临,突如其来的疫情打破了平静的生活! 在家的你是否很无聊,如果无聊就来学习吧! 世上只有一种投资只赚不赔,那就是学习!!! 传智播客于2020年升级了Java学习线路图,硬核升级,免费放送! 学完你就是中级程序员,能更快一步找到工作! 一、Java基础 JavaSE基础是Java中级程序员的起点,是帮助你从小白到懂得编程的必经之路。 在Java基础板块中有6个子模块的学...
B 站上有哪些很好的学习资源?
哇说起B站,在小九眼里就是宝藏般的存在,放年假宅在家时一天刷6、7个小时不在话下,更别提今年的跨年晚会,我简直是跪着看完的!! 最早大家聚在在B站是为了追番,再后来我在上面刷欧美新歌和漂亮小姐姐的舞蹈视频,最近两年我和周围的朋友们已经把B站当作学习教室了,而且学习成本还免费,真是个励志的好平台ヽ(.◕ฺˇд ˇ◕ฺ;)ノ 下面我们就来盘点一下B站上优质的学习资源: 综合类 Oeasy: 综合...
爬取薅羊毛网站百度云资源
这是疫情期间无聊做的爬虫, 去获取暂时用不上的教程 import threading import time import pandas as pd import requests import re from threading import Thread, Lock # import urllib.request as request # req=urllib.request.Requ...
如何优雅地打印一个Java对象?
你好呀,我是沉默王二,一个和黄家驹一样身高,和刘德华一样颜值的程序员。虽然已经写了十多年的 Java 代码,但仍然觉得自己是个菜鸟(请允许我惭愧一下)。 在一个月黑风高的夜晚,我思前想后,觉得再也不能这么蹉跎下去了。于是痛下决心,准备通过输出的方式倒逼输入,以此来修炼自己的内功,从而进阶成为一名真正意义上的大神。与此同时,希望这些文章能够帮助到更多的读者,让大家在学习的路上不再寂寞、空虚和冷。 ...
雷火神山直播超两亿,Web播放器事件监听是怎么实现的?
Web播放器解决了在手机浏览器和PC浏览器上播放音视频数据的问题,让视音频内容可以不依赖用户安装App,就能进行播放以及在社交平台进行传播。在视频业务大数据平台中,播放数据的统计分析非常重要,所以Web播放器在使用过程中,需要对其内部的数据进行收集并上报至服务端,此时,就需要对发生在其内部的一些播放行为进行事件监听。 那么Web播放器事件监听是怎么实现的呢? 01 监听事件明细表 名...
3万字总结,Mysql优化之精髓
本文知识点较多,篇幅较长,请耐心学习 MySQL已经成为时下关系型数据库产品的中坚力量,备受互联网大厂的青睐,出门面试想进BAT,想拿高工资,不会点MySQL优化知识,拿offer的成功率会大大下降。 为什么要优化 系统的吞吐量瓶颈往往出现在数据库的访问速度上 随着应用程序的运行,数据库的中的数据会越来越多,处理时间会相应变慢 数据是存放在磁盘上的,读写速度无法和内存相比 如何优化 设计...
HTML5适合的情人节礼物有纪念日期功能
前言 利用HTML5,css,js实现爱心树 以及 纪念日期的功能 网页有播放音乐功能 以及打字倾诉感情的画面,非常适合情人节送给女朋友 具体的HTML代码 具体只要修改代码里面的男某某和女某某 文字段也可自行修改,还有代码下半部分的JS代码需要修改一下起始日期 注意月份为0~11月 也就是月份需要减一。 当然只有一部分HTML和JS代码不够运行的,文章最下面还附加了完整代码的下载地址 &lt;!...
Python新型冠状病毒疫情数据自动爬取+统计+发送报告+数据屏幕(三)发送篇
今天介绍的项目是使用 Itchat 发送统计报告 项目功能设计: 定时爬取疫情数据存入Mysql 进行数据分析制作疫情报告 使用itchat给亲人朋友发送分析报告 基于Django做数据屏幕 使用Tableau做数据分析 来看看最终效果 目前已经完成,预计2月12日前更新 使用 itchat 发送数据统计报告 itchat 是一个基于 web微信的一个框架,但微信官方并不允许使用这...
相关热词 c#时间格式化 不带- c#替换字符串中指定位置 c# rdlc 动态报表 c# 获取txt编码格式 c#事件主动调用 c#抽象工厂模式 c# 如何添加类注释 c# static块 c#处理浮点数 c# 生成字母数字随机数
立即提问