Extract the table in the PDF,outputs the data similar to the json format

在开发RPA项目时,需要提取pdf表格内容,并保留表格格式。在网络中苦苦寻求多日,未能找到一份完全满足项目需求的开源库。最终采用pymupdf+cv2框架实现对pdf表格的提取。由pymupdf读取pdf(pumupdf还支持xps格式文件)内容,而cv2依据提出内容中的线条绘制并计算表格轮廓,最终找找到文本内容与表格对应关系。项目比较小众,代码也很零散,但希望能够帮助到恰好有需要的人。

In the RPA project, the content in pdf format needs to be extracted and the table format is retained. I have been struggling for many days in the network to find an open source library that fully meets the needs of the project. Finally, the pymupdf + cv2 framework is used to read the content of pdf from pymupdf (pumupdf also supports xps format files), and cv2 elaborates the drawing in the proposed content and calculates the table, and finally finds the relationship between the found content text and the table. There are many projects, and the code is very fragmented, but I hope to help those in need.

在项目中

  1. tabula-py源码使用java实现,可以参考tabula-java。提取PDF表格能力强悍,但在项目运行中偶尔出现一些异常
  2. pdfplumber使用非常便捷,但部分pdf中的表格无法提取
  3. camelot因为本人水平有限,pip安装过程中遇到一些问题,导致无法安装

python3
PyMuPDF==1.19.1
cv2==4.5.4

由于已有的开源项目不能满足限制的项目,于是打算使用机器视觉的方式来提取表格相关的信息。大致处理流程如下:

 

 

 

To finish reading, please visit source site