Introduction
PDF files are one of the most widely used document formats in the world, but they can quickly grow to unwieldy sizes. Whether you are sharing reports via email, uploading documents to a web portal, or archiving years of records, large PDF files create real problems. Understanding how PDF compression works gives you the knowledge to reduce PDF size effectively without sacrificing the quality your readers expect.
In this guide we break down the internal structure of a PDF, explain the compression algorithms that operate on different data streams, and help you decide between lossy and lossless approaches. By the end you will know exactly what happens when you compress a PDF online and how to choose the right settings for every situation.
Inside a PDF: Objects, Streams, and Cross-Reference Tables
A PDF file is not a single blob of data. It is a structured collection of objects that describe pages, fonts, images, annotations, and metadata. Each object can contain a stream, which is the raw binary payload, for example the pixel data of an embedded photograph or the outlines of a vector graphic.
The cross-reference table at the end of the file acts as an index, telling a PDF reader exactly where each object starts. When a PDF is updated incrementally, new objects and a new cross-reference section are appended rather than rewriting the whole file. Over time this leads to bloat, duplicate objects, and orphaned data that inflates the file size with no visible benefit.
Effective PDF compression therefore involves two broad strategies: compressing individual streams more aggressively, and restructuring the file to remove redundancy at the object level.
Lossless vs Lossy Compression
Before diving into specific algorithms, it is important to understand the fundamental distinction between lossless and lossy compression, because the choice affects every decision you make when optimizing a PDF.
Lossless Compression
Lossless compression reduces file size without discarding any data. When you decompress the file you get back the exact original, bit for bit. The most common lossless algorithm used inside PDFs is Flate (Deflate/zlib), the same algorithm behind PNG images and ZIP archives. Other lossless filters available in the PDF specification include:
- LZW — an older dictionary-based algorithm, less efficient than Flate but still supported for backward compatibility.
- RunLength — encodes sequences of identical bytes, useful for simple graphics with large areas of flat color.
- ASCII85 and ASCIIHex — not true compression but encoding filters that convert binary data to printable ASCII, sometimes stacked with Flate for transport safety.
Lossless compression is ideal for text-heavy documents, technical drawings, and any file where absolute fidelity is required. Typical size reduction ranges from 20 % to 60 % depending on content.
Lossy Compression
Lossy compression achieves much greater size reduction by permanently removing data that is deemed less perceptible to the human eye. In PDFs this almost exclusively targets embedded raster images, which are often the single largest contributor to file size.
The primary lossy technique is JPEG (DCT) recompression. By lowering the JPEG quality factor of embedded photographs, a compressor can shrink an image stream dramatically. A quality setting of 60-75 typically produces visually acceptable results for on-screen viewing, while settings below 50 introduce noticeable artifacts around edges and text overlays.
Another lossy strategy is color space conversion. Converting images from CMYK to RGB, or from RGB to grayscale, removes an entire channel of data. This is appropriate when the document will only be viewed on screen and never sent to a professional print workflow.
Stream Compression in Detail
Every content stream inside a PDF can carry its own compression filter. When you open a raw PDF in a text editor you will see entries like /Filter /FlateDecode on stream objects. The PDF specification allows filters to be chained, so a single stream might first be run through a predictor filter and then through Flate for better compression ratios.
Flate (Deflate)
Flate is the workhorse of modern PDF compression. It combines LZ77 dictionary matching with Huffman coding to achieve strong compression on text streams, vector drawing commands, and metadata. Most PDF creation tools already apply Flate, but at a moderate compression level to keep generation fast. A dedicated optimizer can re-compress these streams at the maximum level, often shaving another 10-15 % off the file.
Predictor Filters
For image data stored as raw samples, applying a predictor filter before Flate significantly improves compression. The PNG predictor, for example, stores each pixel as the difference from the pixel to its left. Because neighboring pixels in a photograph tend to be similar, the differences cluster around zero, giving Flate far more repeating patterns to exploit.
JBIG2 for Scanned Text
Scanned documents are a special case. JBIG2 is a compression standard specifically designed for bitonal (black-and-white) images such as scanned pages of text. It works by identifying repeated character shapes on a page, storing each unique shape once, and then referencing it wherever it appears. JBIG2 can reduce a scanned page to a fraction of the size achieved by Flate alone, making it invaluable for archiving large volumes of scanned paperwork.
Image Downsampling
Embedded images are frequently stored at resolutions far higher than necessary. A photograph captured at 300 DPI and placed in a PDF destined for on-screen reading does not need that resolution. Downsampling reduces the pixel dimensions of an image, which in turn reduces the amount of data that needs to be compressed.
Common downsampling targets:
- 72-96 DPI — suitable for screen-only documents, email attachments, and web downloads.
- 150 DPI — a good balance for documents that may be printed on a desktop printer.
- 300 DPI — required for professional print production; downsampling below this is not recommended for press-quality output.
Downsampling algorithms also matter. Bicubic interpolation produces the smoothest results when reducing resolution, while average downsampling is faster but can look slightly softer. A well-tuned PDF compressor lets you choose the target DPI and algorithm to match your use case.
Font Subsetting and Embedding
Fonts can contribute significantly to PDF file size, especially when multiple typefaces are fully embedded. A full font file for a professional typeface may exceed 500 KB, and a document using four font styles (regular, bold, italic, bold-italic) could carry over 2 MB of font data alone.
Font subsetting solves this by embedding only the glyphs actually used in the document. If your PDF contains English text that uses 60 unique characters out of a font with 2,000 glyphs, subsetting removes the other 1,940 glyphs and their associated hinting data. The savings can be enormous, particularly for CJK (Chinese, Japanese, Korean) fonts that contain tens of thousands of characters.
Some optimizers go further by converting TrueType outlines to CFF (Compact Font Format), which represents curves more efficiently. This is a lossless transformation that can reduce font stream sizes by 20-30 %.
Object-Level Optimization
Beyond compressing individual streams, a PDF optimizer can restructure the file at the object level:
- Removing duplicate objects — if the same image is embedded multiple times (common when copying pages between documents), the optimizer keeps one copy and updates all references.
- Linearization — rearranges objects so the first page can be displayed while the rest of the file is still downloading. This does not reduce file size but dramatically improves perceived performance for web-hosted PDFs.
- Stripping metadata — removing XMP metadata, document information dictionaries, embedded thumbnails, and piece information can free up space. Be cautious, however, as some metadata may be legally or operationally required.
- Removing unused objects — incremental saves leave orphaned objects. A full rewrite with garbage collection eliminates them.
- Flattening form fields and annotations — if the document no longer needs to be interactive, burning annotations into the page content stream removes the overhead of interactive objects.
Practical Compression Strategies by Use Case
Not every document benefits from the same compression recipe. Here are recommended strategies for common scenarios:
Email Attachments
Many mail servers reject attachments over 10 MB or even 5 MB. For email, apply aggressive image downsampling to 96 DPI, use JPEG quality around 65, subset fonts, and strip unnecessary metadata. This combination typically reduces a 20 MB report to under 2 MB with acceptable visual quality.
Web Downloads
Speed matters for web-hosted PDFs. Linearize the file, downsample images to 120 DPI, and compress streams at maximum Flate level. Consider converting full-color diagrams to indexed color if they use fewer than 256 distinct colors. Use our online PDF compressor to apply these settings without installing software.
Archival (PDF/A)
Archival standards like PDF/A impose constraints: fonts must be embedded (not subsetted below certain thresholds in some profiles), and certain compression methods are restricted. Use lossless compression only, and validate the output against the target PDF/A profile.
Print Production
For files destined for a commercial printer, do not downsample images below 300 DPI and avoid lossy JPEG recompression of images that have already been carefully color-managed. Lossless stream recompression and object deduplication are safe optimizations that can still yield meaningful savings.
Measuring Compression Results
After compressing a PDF, you should verify both the file size reduction and the visual quality. Open the compressed file and zoom into image-heavy areas at 200 % to check for JPEG artifacts. Compare text sharpness to the original. If you are compressing for print, produce a proof on the target device.
A good benchmark: for a typical business document with a mix of text, charts, and photographs, lossless optimization alone should achieve 15-40 % reduction. Adding lossy image compression and downsampling can push that to 60-85 % reduction, depending on the original image quality and resolution. Our compressor shows a visual bar chart comparing original and compressed sizes with an animated reveal, making the result immediately tangible.
How Our PDF Compressor Applies These Techniques
When you upload a file to the PDF compressor tool, the entire process runs in your browser with zero external requests. Before compression starts, the tool automatically detects encrypted PDFs (warning that protection will be removed) and PDF/A-compliant documents (warning that archival compliance may be broken) — you can choose to proceed or cancel in either case. Compression runs in a dedicated Web Worker, keeping the interface responsive while even large files are processed. The tool also warns you if a file exceeds 50 MB (processing may be slow) and blocks files over 200 MB, recommending a desktop application instead.
The tool runs a two-phase compression pipeline. First, a lossless optimization pass always executes regardless of the quality preset you choose:
- Metadata stripping — removes document info dictionaries (title, author, producer) and XMP metadata streams that add bulk without visible benefit.
- Stream deduplication — detects identical embedded streams (images, fonts) referenced in multiple places and collapses them to a single copy. During image recompression, an additional image deduplication pass fingerprints each image's content so identical images embedded under different objects are compressed only once — the result is shared across all references, saving both processing time and output size.
- Unused object removal — walks the entire object graph from the document root and deletes any orphaned objects left behind by incremental saves.
- Cross-reference rebuild — the file is saved with a clean, optimized cross-reference table, eliminating fragmentation from prior edits.
- Font stream optimization — compresses uncompressed embedded font programs with Flate (many PDF generators leave fonts uncompressed), and strips ToUnicode CMap streams that are only needed for text copy-paste, not for rendering. Documents with multiple fully-embedded fonts can save hundreds of kilobytes from this step alone.
Second, image recompression runs based on the quality preset you select. FlateDecode and DCTDecode images — including grayscale images and multi-filter chains (ASCII85+FlateDecode) — are re-encoded as JPEG at your chosen quality level. Unlike tools that use arbitrary scale factors, our compressor calculates the effective DPI of each image from the page dimensions and downsamples intelligently to a target: 300 DPI for high quality, 200 DPI for balanced, 150 DPI for web, and 96 DPI for maximum compression. Images already at or below the target DPI are left untouched — no upsampling ever occurs. This means a scanned document at 600 DPI can be halved in resolution with no visible difference at normal viewing, while a 150 DPI document stays crisp. The tool only replaces an image when the new version is actually smaller than the original.
For text-only PDFs, the lossless pass alone typically achieves 10-30% reduction. For image-heavy documents, combined lossless and lossy optimization routinely reaches 50-80% reduction.
Five presets are available, including a Lossless option that applies only structural optimizations (metadata, fonts, deduplication, unused objects) with zero image quality change — ideal for legal, medical, and archival documents. After selecting a file, the tool shows a document summary. You can cancel at any time, and fine-tune the image quality slider (10-100%) to dial in the exact balance between file size and visual fidelity. Presets like "Web Optimized" or "Balanced" set the slider to recommended values, but you can adjust it freely. After compression, the tool shows a detailed breakdown of exactly what was optimized: how many images were recompressed, whether metadata was stripped, how many fonts were optimized, duplicate streams merged, and unused objects removed. This transparency helps you understand why a file shrank (or didn't) and make informed decisions about quality settings.
Conclusion
PDF compression is not a single algorithm but a collection of techniques applied to different layers of the file structure. Stream compression with Flate handles text and vector data. JPEG recompression and downsampling tackle embedded images. Font subsetting trims typographic data. Object-level cleanup removes structural waste.
By understanding these mechanisms you can make informed choices about quality trade-offs, select the right settings for your use case, and confidently reduce PDF file size without degrading the reading experience. Whether you are emailing a contract, publishing a whitepaper, or archiving a decade of invoices, the right compression strategy ensures your PDFs are lean, fast, and professional.