February 8, 2022 PDF

Reduce PDF size without sacrificing visual quality or metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losingmetadata. Description DietPDF aims at reducing PDF file size while not degrading quality. Here are some tricks used to achieve this goal: Use Zopfli instead of Zlib to get better compression ratio while beingcompatible with Zlib. Use JpegTran to optimize and remove unnecessary data from embedded JPEGs. Use of Run-Length Encoding to help Zopfli achieve better compression. Use Zopfli on embedded JPEGs, it helps sometimes Remove unnecessary spaces […]

November 14, 2021 PDF

pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searchable text. It runs on the command-line or as a web server. The server version can be deployed to App Engine easily. pdf_sprinkles has only been tested with English-language text, but it should work for most European languages supported by the Document AI API today. It is known not to work with RTL languages and with CJK scripts currently. Installation pdf_sprinkles is […]

November 14, 2021 PDF

Extract the table in the PDF，outputs the data similar to the json format

在开发RPA项目时，需要提取pdf表格内容，并保留表格格式。在网络中苦苦寻求多日，未能找到一份完全满足项目需求的开源库。最终采用pymupdf+cv2框架实现对pdf表格的提取。由pymupdf读取pdf（pumupdf还支持xps格式文件）内容，而cv2依据提出内容中的线条绘制并计算表格轮廓，最终找找到文本内容与表格对应关系。项目比较小众，代码也很零散，但希望能够帮助到恰好有需要的人。 In the RPA project, the content in pdf format needs to be extracted and the table format is retained. I have been struggling for many days in the network to find an open source library that fully meets the needs of the project. Finally, the pymupdf + cv2 framework is used to read the content of pdf from pymupdf (pumupdf also supports xps format files), and cv2 elaborates the drawing in the proposed content and calculates the table, and […]

November 9, 2021 PDF

Searching keywords in PDF file folders

Steps to use this Python scripts：(1)Paste this script into the file folder containing the PDF files you need to search from;(2)This file is based in anaconda envirionment and requires the Python package: PDfMiner;(3)Run this file and input a keyword;(4)You can orientate the keyword in detailed line and passage now! GitHub View Github

October 30, 2021 PDF

Fuzzing PDFs like its 1990s

This is the fuzzer I made to fuzz Preview on macOS and iOS like 8years back when I just started fuzzing things. Some disclosed vulnerabilities: CVE-2015-3723 CVE-2016-1737 CVE-2016-1740 CVE-2017-7031 The basic idea of this fuzzer was to mutate the streams of the pdf files without screwing the PDF Structure as a whole. I collected some hundreds of PDFs and converted the PDFs to Python script using Didier Stevens’s pdf-parser -g flag. The fuzzer uses cPDF that I modified to mutate […]

October 1, 2021 PDF

Distfit: Probability density fitting

Star it if you like it! Background distfit is a python package for probability density fitting across 89 univariate distributions to non-censored data by residual sum of squares (RSS), and hypothesis testing.Probability density fitting is the fitting of a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. distfit scores each of the 89 different distributions for the fit wih the empirical distribution and return the best scoring distribution. Functionalities The distfit library is […]

September 12, 2021 PDF

A Python library for rendering reMarkable documents to PDF files

rmrl is a Python library for rendering reMarkable documents to PDF files. It takes the original PDF document and the files describing your annotations, combining them to produce a document close to what reMarkable itself would output. Demo The same notebook was rendered to a PDF via the reMarkable app and rmrl. The resultant PDF files were converted to PNGs with ImageMagick at 300 dpi. reMarkable output rmrl output The biggest differences are the lack of texture in the pencils […]

July 20, 2021 PDF

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions developed for this CLI will be exposed as an API for programmatic use. WORK IN PROGRESS! CLI Features The generated HTML looks like the raw PDF file with the following additions: Pretty-print dictionary object Extract an object contained in an object stream and insert it in the flow like a regular object Decompress stream and display […]

July 14, 2021 PDF

Python utility library for compositing PDF documents with reportlab

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the source code: $ git clone https://github.com/michaelgale/pdfdoc-py.git $ cd pdfdoc-py $ python setup.py install Usage After installation, the package can imported: $ python >>> import pdfdoc >>> pdfdoc.__version__ Example of making a label sheet with 25 labels on Avery 5262 self-adhesive label sheets: from pdfdoc import * ld = LabelDoc(“my_labels.pdf”, style=AVERY_5262_LABEL_DOC_STYLE) labels = [i for i in range(25)] for label, row, […]

July 10, 2021 PDF

Uses PyPDF3 for reading and writing PDF files written in python

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a markdown file. It is written in pure python and uses PyPDF3 for reading and writing PDF files. pystitcher is a command line tool, with very few cli options: usage: pystitcher [-h] [–version] [-v] [–cleanup | –no-cleanup] spine.md output.pdf Stitch PDF files together positional arguments: spine.md Input markdown file output.pdf Output PDF file optional arguments: -h, –help show this […]