以前没接触过python,昨儿才把软件(python3.10和pycharm2022)装好。想试试用python提取PDF文件中的表格,我用的PDF是可以复制的那种不是扫描的,第四页开始有表格。在网上找了好多相关的经验贴,觉着合适的我都试着做了,然鹅跑了一天bug哭唧唧
这个是用camelot的代码
import camelot # 从PDF文件中提取表格
tables = camelot.read_pdf('C:\Users\M\Desktop\0.pdf', pages='4', flavor='stream') # 将表格数据转化为csv文件
tables[0].to_csv(C:\Users\M\Desktop\0.csv')
结果是这样的……
File "C:\Users\M\PycharmProjects\smooth\table.py", line 2
tables = camelot.read_pdf('C:\Users\M\Desktop\0.pdf', pages='4', flavor='stream')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Process finished with exit code 1
还有用pdfplumber的,也几乎都是这样的结果,或者只输出这个结果的最后一行文字。其中一个代码是这样的
import PyPDF2
import pdfplumber
import pandas as pd
file = r'C:\Users\M\Desktop\0.pdf' # 自己的pdf路径
with pdfplumber.open(file) as pdf:
for i in pdf.pages:
for 表格 in i.extract_tables():
数据 = pd.DataFrame(表格[1:],columns = 表格[0])
数据.to_csv(r'C:\Users\M\Desktop\0.csv',mode = 'a',encoding = 'ANSI') # mode为a表示自动在后面添加数据。
结果:
Traceback (most recent call last):
File "C:\Users\M\PycharmProjects\smooth\table.py", line 9, in
数据.to_csv(r'C:\Users\M\Desktop\0.csv',mode = 'a',encoding = 'ANSI') # mode为a表示自动在后面添加数据。
File "C:\Users\M\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\generic.py", line 3551, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "C:\Users\M\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\formats\format.py", line 1180, in to_csv
csv_formatter.save()
File "C:\Users\M\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\formats\csvs.py", line 261, in save
self._save()
File "C:\Users\M\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\formats\csvs.py", line 265, in _save
self._save_header()
File "C:\Users\M\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\formats\csvs.py", line 270, in _save_header
self.writer.writerow(self.encoded_labels)
File "C:\Users\M\AppData\Local\Programs\Python\Python310\lib\encodings\mbcs.py", line 25, in encode
return mbcs_encode(input, self.errors)[0]
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
Process finished with exit code 1
这个代码输出了表格,但是和原PDF并不一样哎
整的有点心累,所以就来这了。感谢指点迷津,比心~