11/24/2021»»Wednesday

Ghostscript Docker

The easiest way to install OCRmyPDF is to follow the steps for your operatingsystem/platform. This version may be out of date, however.

These platforms have one-liner installs:

$ make help build Builds all the dockerfiles in the repository. Dockerfiles Tests the changes to the Dockerfiles build. Image Build a Dockerfile (ex. Latest-versions Checks all the latest versions of the Dockerfile contents. Run Run a Dockerfile from the command at the top of the file (ex. Shellcheck Runs the shellcheck tests on the scripts. Test Runs the tests on the. Sep 03, 2021 When i execute it from my FastApi myapp.py, the text from the original file can not be extracted - results in an exception caused by pdftoppm - and the file manipulated with ghostscript is completely white and all text that gets extracted is ' x0c'. Same happens when i deploy the application as a docker image. Textract version 1.6.3.

Debian, Ubuntu

aptinstallocrmypdf

Windows Subsystem for Linux

aptinstallocrmypdf

Fedora

dnfinstallocrmypdf

macOS

brewinstallocrmypdf

LinuxBrew

brewinstallocrmypdf

FreeBSD

pkginstallpy38-ocrmypdf

Conda (WSL, macOS, Linux)

condainstallocrmypdf

More detailed procedures are outlined below. If you want to do a manualinstall, or install a more recent version than your platform provides, read on.

Platform-specific steps

OCRmyPDF versions in Debian & Ubuntu

Users of Debian 9 (“stretch”) or later, or Ubuntu 18.04 or later, including usersof Windows Subsystem for Linux, may simply

As indicated in the table above, Debian and Ubuntu releases may lagbehind the latest version. If the version available for your platform isout of date, you could opt to install the latest version from source.See Installing HEAD revision fromsources. Ubuntu 16.10 to 17.10inclusive also had ocrmypdf, but these versions are end of life.

For full details on version availability for your platform, check theDebian Package Tracker orUbuntu launchpad.net.

Note

OCRmyPDF for Debian and Ubuntu currently omit the JBIG2 encoder.OCRmyPDF works fine without it but will produce larger output files.If you build jbig2enc from source, ocrmypdf 7.0.0 and later willautomatically detect it (specifically the jbig2 binary) on thePATH. To add JBIG2 encoding, see Installing the JBIG2 encoder.

OCRmyPDF version

Users of Fedora 29 or later may simply

For full details on version availability, check the Fedora PackageTracker.

If the version available for your platform is out of date, you could optto install the latest version from source. See Installing HEAD revisionfrom sources.

Note

OCRmyPDF for Fedora currently omits the JBIG2 encoder due to patentissues. OCRmyPDF works fine without it but will produce larger outputfiles. If you build jbig2enc from source, ocrmypdf 7.0.0 and laterwill automatically detect it on the PATH. To add JBIG2 encoding,see Installing the JBIG2 encoder.

Ubuntu 20.04 includes ocrmypdf 9.6.0 - you can install that with apt. Toinstall a more recent version, uninstall the system-provided version ofocrmypdf, and install the following dependencies:

To install ocrmypdf for the system:

To install for the current user only:

Ubuntu 18.04 includes ocrmypdf 6.1.2 - you can install that with apt, butit is quite old now. To install a more recent version, uninstall the old versionof ocrmypdf, and install the following dependencies:

We will need a newer version of pip then was available for Ubuntu 18.04:

Then install the most recent ocrmypdf for the local user and set theuser’s PATH to check for the user’s Python packages.

To add JBIG2 encoding, see Installing the JBIG2 encoder.

No package is available for Ubuntu 16.04. OCRmyPDF 8.0 and newer requirePython 3.6. Ubuntu 16.04 ships Python 3.5, but you can install Python3.6 on it. Or, you can skip Python 3.6 and install OCRmyPDF 7.x or older- for that procedure, please see the installation documentation for theversion of OCRmyPDF you plan to use.

Install system packages for OCRmyPDF

This will install a Python 3.6 binary at /usr/bin/python3.6alongside the system’s Python 3.5. Do not remove the system Python. Thiswill also install Tesseract 4.0 from a PPA, since the version availablein Ubuntu 16.04 is too old for OCRmyPDF.

Now install pip for Python 3.6. This will install the Python 3.6 versionof pip at /usr/local/bin/pip.

Install OCRmyPDF

OCRmyPDF requires the locale to be set for UTF-8. On some minimalUbuntu installations, such as the Ubuntu 16.04 Docker images it may benecessary to set the locale.

Now install OCRmyPDF for the current user, and ensure that the PATHenvironment variable contains $HOME/.local/bin.

To add JBIG2 encoding, see Installing the JBIG2 encoder.

There is an Arch User Repository (AUR) package for OCRmyPDF.

Installing AUR packages as root is not allowed, so you must first setup anon-root user andconfigure sudo.The standard Docker image, archlinux/base:latest, does not have anon-root user configured, so users of that image must follow these guides. Ifyou are using a VM image, such as the official Vagrant image, this work may alreadybe completed for you.

Next you should install the base-devel package group. This includes thestandard tooling needed to build packages, such as a compiler and binary tools.

Now you are ready to install the OCRmyPDF package.

At this point you will have a working install of OCRmyPDF, but the Tesseractinstall won’t include any OCR language data. You can install thetesseract-data package group to add all supportedlanguages, or use that package listing to identify the appropriate package foryour desired language.

As an alternative to this manual procedure, consider using an AUR helper. Such a tool willautomatically fetch, build and install the AUR package, resolve dependencies(including dependencies on AUR packages), and ease the upgrade procedure.

If you have any difficulties with installation, check the repository packagepage.

Note

The OCRmyPDF AUR package currently omits the JBIG2 encoder. OCRmyPDF worksfine without it but will produce larger output files. The encoder isavailable from the jbig2enc-git AUR package and may be installedusing the same series of steps as for the installation OCRmyPDF AURpackage. Alternatively, it may be built manually from source following theinstructions in Installing the JBIG2 encoder. If JBIG2 isinstalled, OCRmyPDF 7.0.0 and later will automatically detect it.

To install OCRmyPDF for Alpine Linux:

There is no OS-level packaging available for Mageia, so you must install thedependencies:

To install ocrmypdf for the system:

Ghostscript Dockerfile

Or, to install for the current user only:

See theRepology page.

In general, first install the OCRmyPDF package for your system, thenoptionally use the procedure Installing with Pythonpip to install a more recent version.

OCRmyPDF is now a standard Homebrew formula. Toinstall on macOS:

This will include only the English language pack. If you need otherlanguages you can optionally install them all:

Note

Users who previously installed OCRmyPDF on macOS usingpipinstallocrmypdf should remove the pip version(pip3uninstallocrmypdf) before switching to the Homebrewversion.

Note

Users who previously installed OCRmyPDF from the private tap shouldswitch to the mainline version (brewuntapjbarlow83/ocrmypdf)and install from there.

These instructions probably work on all macOS supported by Homebrew, and arefor installing a more current version of OCRmyPDF than is available fromHomebrew. Note that the Homebrew versions usually track the release versionsfairly closely.

If it’s not already present, install Homebrew.

Update Homebrew:

Install or upgrade the required Homebrew packages, if any are missing.To do this, use breweditocrmypdf to obtain a recent list of Homebrewdependencies. You could also check the .workflows/build.yml.

This will include the English, French, German and Spanish languagepacks. If you need other languages you can optionally install them all:

Update the homebrew pip:

You can then install OCRmyPDF from PyPI, for the current user:

or system-wide:

The command line program should now be available:

Note

Administrator privileges will be required for some of these steps.

You must install the following for Windows:

  • Python 3.7 (64-bit) or later

  • Tesseract 4.0 or later

  • Ghostscript 9.50 or later

Using the Chocolatey package manager, install thefollowing when running in an Administrator command prompt:

  • chocoinstallpython3

  • chocoinstall--pretesseract

  • chocoinstallghostscript

  • chocoinstallpngquant (optional)

The commands above will install Python 3.x (latest version), Tesseract, Ghostscriptand pngquant. Chocolatey may also need to install the Windows Visual C++ RuntimeDLLs or other Windows patches, and may require a reboot.

You may then use pip to install ocrmypdf. (This can performed by a user orAdministrator.):

  • pipinstallocrmypdf

Chocolatey automatically selects appropriate versions of these applications. If youare installing them manually, please install 64-bit versions of all applications for64-bit Windows, or 32-bit versions of all applications for 32-bit Windows. Mixingthe “bitness” of these programs will lead to errors.

OCRmyPDF will check the Windows Registry and standard locations in your Program Filesfor third party software it needs (specifically, Tesseract and Ghostscript). Tooverride the versions OCRmyPDF selects, you can modify the PATH environmentvariable. Follow these directionsto change the PATH.

Warning

As of early 2021, users have reported problems with the Microsoft Store version ofPython and OCRmyPDF. These issues affect many other third party Python packages.Please download Python from Python.org or Chocolatey instead, and do not use theMicrosoft Store version.

  1. Install Ubuntu 18.04 for Windows Subsystem for Linux, if not already installed.

  2. Follow the procedure to install OCRmyPDF on Ubuntu 18.04.

  3. Open the Windows command prompt and create a symlink:

Then confirm that the expected version from PyPI () is installed:

You can then run OCRmyPDF in the Windows command prompt or Powershell, prefixingwsl, and call it from Windows programs or batch files.

First install the the following prerequisite Cygwin packages using setup-x86_64.exe:

Note

The Cygwin package for Ghostscript in versions 9.52 and9.52-1 contained a bug that caused an exception to occur whenocrmypdf invoked gs. Make sure you have either 9.50 (or earlier)or 9.52-2 (or later).

Then open a Cygwin terminal (i.e. mintty), run the following commands. Notethat if you are using the version of pip that was installed with the CygwinPython package, the command name will be pip3. If you have since updatedpip (with, for instance pip3install--upgradepip) the the command islikely just pip instead of pip3:

The optional dependency “unpaper” that is currently not available under Cygwin.Without it, certain options such as --clean will produce an error message.However, the OCR-to-text-layer functionality is available.

You can also Install the Docker container on Windows. Ensure thatyour command prompt can run the docker “hello world” container.

FreeBSD 11.3, 12.0, 12.1-RELEASE and 13.0-CURRENT are supported. Otherversions likely work but have not been tested.

To install a more recent version, you could attempt to first install the systemversion with pkg, then use pipinstall--userocrmypdf.

For some users, installing the Docker image will be easier thaninstalling all of OCRmyPDF’s dependencies.

See OCRmyPDF Docker image for more information.

OCRmyPDF is delivered by PyPI because it is a convenient way to installthe latest version. However, PyPI and pip cannot address the factthat ocrmypdf depends on certain non-Python system libraries andprograms being installed.

Docker

Warning

Debian and Ubuntu users: unfortunately, Debian and Ubuntu customizePython in non-standard ways, and the nature of these customizationsvaries from release to release. This can make for a frustratinguser experience. The instructions below work on almost all platforms thathave Python installed, except for Debian and Ubuntu, where you may needto take additional steps. For best results on Debian and Ubuntu, use theapt packages; or if these are too old, runaptinstallpython3-pippython3-venv, create a virtual environment,and install OCRmyPDF in that environment.

See here for more inforation on Debian-Python issues.

For best results, first install your platform’sversion ofocrmypdf, using the instructions elsewhere in this document. Thenyou can use pip to get the latest version if your platform versionis out of date. Chances are that this will satisfy most dependencies.

Use ocrmypdf--version to confirm what version was installed.

Then you can install the latest OCRmyPDF from the Python wheels. Firsttry:

You should then be able to run ocrmypdf--version and see that thelatest version was located.

Since pip3install--user does not work correctly on some platforms,notably Ubuntu 16.04 and older, and the Homebrew version of Python,instead use this for a system wide installation:

Note

AArch64 (ARM64) users: this process will be difficult because mostPython packages are not available as binary wheels for your platform.You’re probably better off using a platform install on Debian, Ubuntu,or Fedora.

OCRmyPDF currently requires these external programs and libraries to beinstalled, and must be satisfied using the operating system packagemanager. pip cannot provide them.

  • Python 3.6 or newer

  • Ghostscript 9.15 or newer

  • qpdf 8.1.0 or newer

  • Tesseract 4.0.0-beta or newer

As of ocrmypdf 7.2.1, the following versions are recommended:

  • Python 3.7 or 3.8

  • Ghostscript 9.23 or newer

  • qpdf 8.2.1

  • Tesseract 4.0.0 or newer

  • jbig2enc 0.29 or newer

  • pngquant 2.5 or newer

  • unpaper 6.1

Ghostscript Docker

jbig2enc, pngquant, and unpaper are optional. If missing certainfeatures are disabled. OCRmyPDF will discover them as soon as they areavailable.

jbig2enc, if present, will be used to optimize the encoding ofmonochrome images. This can significantly reduce the file size of theoutput file. It is not required.jbig2enc is not generallyavailable for Ubuntu or Debian due to lingering concerns about patentissues, but can easily be built from source. To add JBIG2 encoding, seeInstalling the JBIG2 encoder.

pngquant, if present, is optionally used to optimize the encoding ofPNG-style images in PDFs (actually, any that are that losslesslyencoded) by lossily quantizing to a smaller color palette. It is onlyactivated then the --optimize argument is 2 or 3.

unpaper, if present, enables the --clean and --clean-finalcommand line options.

These are in addition to the Python packaging dependencies, meaning thatunfortunately, the pipinstall command cannot satisfy all of them.

If you have git and Python 3.6 or newer installed, you can installfrom source. When the pip installer runs, it will alert you ifdependencies are missing.

If you prefer to build every from source, you will need to buildpikepdf fromsource.First ensure you can build and install pikepdf.

Docker

To install the HEAD revision from sources in the current Python 3environment:

Or, to install in developmentmode,allowing customization of OCRmyPDF, use the -e flag:

You may find it easiest to install in a virtual environment, rather thansystem-wide:

However, ocrmypdf will only be accessible on the system PATH whenyou activate the virtual environment.

To run the program:

If not yet installed, the script will notify you about dependencies thatneed to be installed. The script requires specific versions of thedependencies. Older version than the ones mentioned in the release notesare likely not to be compatible to OCRmyPDF.

To install all of the development and test requirements:

To add JBIG2 encoding, see Installing the JBIG2 encoder.

Completions for bash and fish are available in the project’smisc/completion folder. The bash completions are likely zshcompatible but this has not been confirmed. Package maintainers, pleaseinstall these at the appropriate locations for your system.

To manually install the bash completion, copymisc/completion/ocrmypdf.bash to /etc/bash_completion.d/ocrmypdf(rename the file).

To manually install the fish completion, copymisc/completion/ocrmypdf.fish to~/.config/fish/completions/ocrmypdf.fish.

Ghostscript Docker Pdf

Ghostscript is an interpreter for the PostScript® language and PDF files. It is available under either the GNU GPL Affero license or licensed for commercial use from Artifex Software, Inc. It has been under active development for over 30 years and has been ported to several different systems during this time.Ghostscript consists of a PostScript interpreter layer and a graphics library.

There are a family of other products, including GhostPCL, GhostPDF, and GhostXPS that are built upon the same graphics library. Between them, this family of products offers native rendering of all major page description languages. Our latest product, GhostPDL, pulls all these languages into a single executable.

Ghostscript

Full descriptions of these products can be found here.

In addition to rendering to raster formats, Ghostscript offers high-level conversion through our vector output devices.

Written entirely in C, Ghostscript runs on various embedded operating systems and platforms including Windows, macOS, the wide variety of Unix and Unix-like platforms, and VMS systems.

Current Release

The current Ghostscript release 9.55.0 can be downloaded here.

NEW in this Release

  • New PDF Interpreter: See Changes Coming to the PDF Interpreter
  • JPXPassthrough with pdfwrite: That means that if no rescaling or color conversion of the image data is required, the encoded/compressed image data from the input file will be written unchanged to the output, preventing potential image degradation caused by decompressing and recompressing.
  • And more! Review the full release notes here.

The Ghostscript Blog

Here you will find news, articles and developer notes from the Ghostscript engineering team. Find it here.

Ghostscript Docker Tutorial

Security Advisory

September 9, 2021: CVE-2021-3781 - Learn more...

Developers

  • Ghostscript has a Discord channel #ghostscript and an IRC channel on irc.libera.chat. We bridge the IRC and Discord channels for convenience.

If you want to contribute patches to Ghostscript or GhostPDL you will need to read, understand and sign the Artifex Contributor License Agreement. We also have a bug bounty program if you're looking for a place to start contributing.

Ghostscript

Related projects

Ghostscript Dockery

A JBIG2 image decoder:

Most Viewed Posts