> ## Documentation Index
> Fetch the complete documentation index at: https://docs.somark.cn/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Capabilities, architecture, and quick start for the high-performance open-source PDF library

## What is SoPDF

**SoPDF** is an open-source Python PDF processing library that covers the full workflow from rendering and text extraction to structural editing. You can use it to build production-grade parsing, retrieval, split/merge, and batch processing pipelines.

<Note>
  SoPDF is released under Apache 2.0. You can use it in personal projects, commercial products, and open-source libraries.
</Note>

## Why SoPDF

* High performance: official benchmarks show major speedups vs PyMuPDF in core scenarios (up to about `2.82x` in rendering, `2.74x` in plain text extraction, and `3.17x` in full-text search).
* Feature complete: supports rendering, extraction, search, split/merge, compressed save, metadata, and outline handling.
* Clean API: designed for practical day-to-day engineering workflows.
* Permissive license: Apache 2.0 simplifies internal adoption and external distribution.

## Core capabilities

| Capability            | Description                                                                      |
| --------------------- | -------------------------------------------------------------------------------- |
| Open documents        | Open PDFs from file paths, bytes, or streams                                     |
| Page rendering        | Render to PNG/JPEG, including batch and parallel rendering                       |
| Text workflows        | Extract plain text, extract text blocks with bounding boxes, and search keywords |
| Document editing      | Split, merge, rotate pages, and save with compression                            |
| Operational readiness | Handle encrypted PDFs and auto-repair corrupted PDFs                             |
| Document metadata     | Read/write metadata and read document outline (TOC)                              |

## Architecture

SoPDF uses a dual-engine architecture:

* `pypdfium2` (Google PDFium): rendering, text extraction, and search.
* `pikepdf` (libqpdf): structure reads, writes, and save/compression.

With a dirty-flag + hot-reload sync mechanism, write operations are automatically reflected in the read path, so you do not need to manually coordinate both engines.

## Quick start

```bash theme={null}
pip install sopdf
```

Requirements: Python `3.10+`.

```python theme={null}
import sopdf

with sopdf.open("document.pdf") as doc:
	# Render
	img_bytes = doc[0].render(dpi=150)

	# Extract text
	text = doc[0].get_text()
	blocks = doc[0].get_text_blocks()

	# Search
	hits = doc[0].search("invoice", match_case=False)

	# Split and merge
	part = doc.split(pages=[0, 1, 2], output="chapter1.pdf")
	sopdf.merge(["intro.pdf", "body.pdf"], output="book.pdf")

	# Save
	doc.append(part)
	doc.save("out.pdf", compress=True, garbage=True)
```

## Resources

* Repository: [SoMarkAI/SoPDF](https://github.com/SoMarkAI/SoPDF)
* PyPI: [sopdf](https://pypi.org/project/sopdf/)
* Examples: `examples/` in the repository
* Benchmark suite: `tests/benchmark/` in the repository

## License

[Apache License 2.0](https://github.com/SoMarkAI/SoPDF/blob/main/LICENSE)
