> ## Documentation Index
> Fetch the complete documentation index at: https://docs.somark.cn/llms.txt
> Use this file to discover all available pages before exploring further.

# API Docs

> Complete SoPDF API reference covering all top-level functions, Document, Page, data types, and exceptions

<a id="top" />

## Quick navigation

* [Top-Level Functions](#top-functions)
* [Document Object Operations](#document-ops)
* [Page Object Operations](#page-ops)
* [Data Types](#data-types)
* [Exceptions](#exceptions)

<a id="top-functions" />

## Top-Level Functions

Start here when you need quick file-level tasks: open, merge, and batch rendering.

### Open a PDF document

Opens a PDF document and returns a `Document` instance.

**Signature:** `sopdf.open(path, password, *, stream)`

```python theme={null}
sopdf.open(
    path: str | pathlib.Path | None = None,
    password: str | None = None,
    *,
    stream: bytes | None = None,
) -> Document
```

<ParamField path="path" type="str | Path | None" default="None">
  File-system path to the PDF. Mutually exclusive with `stream`.
</ParamField>

<ParamField path="password" type="str | None" default="None">
  Password for encrypted PDFs. Pass `None` if no password is required.
</ParamField>

<ParamField path="stream" type="bytes | None" default="None">
  Open from raw bytes in memory instead of a file. Mutually exclusive with `path`.
</ParamField>

**Returns:** `Document` — The opened document object.

| Exception       | Condition                                                               |
| --------------- | ----------------------------------------------------------------------- |
| `PasswordError` | The document requires a password that was not provided or is incorrect. |
| `FileDataError` | The file is corrupted or cannot be parsed as a valid PDF.               |

```python theme={null}
# Open from a file path
doc = sopdf.open("report.pdf")

# Open an encrypted document
doc = sopdf.open("secure.pdf", password="hunter2")

# Open from raw bytes in memory
with open("report.pdf", "rb") as f:
    doc = sopdf.open(stream=f.read())

# Recommended: use a context manager for automatic resource cleanup
with sopdf.open("report.pdf") as doc:
    print(doc.page_count)
```

### Merge multiple PDF files

Merges multiple PDF files into a single output file, in the order provided.

**Signature:** `sopdf.merge(inputs, output)`

```python theme={null}
sopdf.merge(
    inputs: list[str | pathlib.Path],
    output: str | pathlib.Path,
) -> None
```

<ParamField path="inputs" type="list[str | Path]">
  Ordered list of PDF file paths to concatenate.
</ParamField>

<ParamField path="output" type="str | Path">
  Destination file path for the merged PDF.
</ParamField>

| Exception       | Condition                                   |
| --------------- | ------------------------------------------- |
| `ValueError`    | `inputs` list is empty.                     |
| `PasswordError` | One of the input files requires a password. |
| `FileDataError` | One of the input files cannot be read.      |

```python theme={null}
sopdf.merge(
    ["intro.pdf", "body.pdf", "appendix.pdf"],
    output="book.pdf",
)
```

### Render multiple pages to image bytes

Renders a list of pages to encoded image bytes.

**Signature:** `sopdf.render_pages(pages, *, dpi, format, alpha, parallel)`

```python theme={null}
sopdf.render_pages(
    pages: list[Page],
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
    parallel: bool = False,
) -> list[bytes]
```

<ParamField path="pages" type="list[Page]">
  List of page objects to render, typically from `doc.pages`.
</ParamField>

<ParamField path="dpi" type="int" default="72">
  Rendering resolution in dots per inch. Common values: 72 (screen preview), 150 (high quality), 300 (print quality).
</ParamField>

<ParamField path="format" type="str" default="&#x22;png&#x22;">
  Output image format: `"png"` or `"jpeg"`.
</ParamField>

<ParamField path="alpha" type="bool" default="False">
  Whether to include an alpha (transparency) channel. Only effective for PNG.
</ParamField>

<ParamField path="parallel" type="bool" default="False">
  Whether to use multiprocessing for rendering. Bypasses the GIL for significant speedup on multi-core machines.
</ParamField>

**Recommended parameter presets**

| Scenario                  | Recommended parameters                               |
| ------------------------- | ---------------------------------------------------- |
| Screen preview            | `dpi=72, format="png", alpha=False, parallel=False`  |
| High-quality export       | `dpi=150, format="png", alpha=False, parallel=False` |
| Large-document throughput | `dpi=300, format="png", alpha=False, parallel=True`  |

**Returns:** `list[bytes]` — A list of encoded image bytes, one entry per page, in the same order as `pages`.

```python theme={null}
with sopdf.open("report.pdf") as doc:
    # Sequential rendering
    images = sopdf.render_pages(doc.pages, dpi=150)

    # Parallel rendering with multiprocessing (recommended for large documents)
    images = sopdf.render_pages(doc.pages, dpi=300, parallel=True)
```

### Render multiple pages and write files

Renders pages and writes the results to a directory as `page_0.png`, `page_1.png`, etc.

**Signature:** `sopdf.render_pages_to_files(pages, output_dir, *, dpi, format, alpha, parallel)`

```python theme={null}
sopdf.render_pages_to_files(
    pages: list[Page],
    output_dir: str | pathlib.Path,
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
    parallel: bool = False,
) -> None
```

<ParamField path="pages" type="list[Page]">
  List of page objects to render.
</ParamField>

<ParamField path="output_dir" type="str | Path">
  Output directory path. Created automatically if it does not exist.
</ParamField>

<ParamField path="dpi" type="int" default="72">
  Rendering resolution in dots per inch.
</ParamField>

<ParamField path="format" type="str" default="&#x22;png&#x22;">
  Output image format: `"png"` or `"jpeg"`.
</ParamField>

<ParamField path="alpha" type="bool" default="False">
  Whether to include an alpha channel (PNG only).
</ParamField>

<ParamField path="parallel" type="bool" default="False">
  Whether to use multiprocessing for rendering.
</ParamField>

**Recommended parameter presets**

| Scenario                | Recommended parameters                  |
| ----------------------- | --------------------------------------- |
| Preview thumbnails      | `dpi=72, format="png", parallel=False`  |
| Archive snapshots       | `dpi=150, format="png", parallel=False` |
| Multi-core batch output | `dpi=300, format="png", parallel=True`  |

```python theme={null}
with sopdf.open("report.pdf") as doc:
    sopdf.render_pages_to_files(doc.pages, "output/", dpi=150, parallel=True)
# Produces: output/page_0.png, output/page_1.png, ...
```

[Back to top](#top)

***

<a id="document-ops" />

## Document Object Operations

Focus on this section after you have a `Document` instance and need page management, splitting, merging, or saving.

`Document` represents an open PDF document. It should never be constructed directly — always obtain one via `sopdf.open()`.

### Properties

#### Total page count

**Member:** `doc.page_count` or `len(doc)`

```python theme={null}
doc.page_count -> int
```

The total number of pages in the document (read-only).

```python theme={null}
len(doc) -> int
```

`len(doc)` is equivalent to `doc.page_count`.

#### Metadata

```python theme={null}
doc.metadata -> Metadata
```

Document metadata — readable and writable via a [Metadata](#metadata) proxy object.

```python theme={null}
# Read
print(doc.metadata.title)
print(doc.metadata.creation_datetime)  # Python datetime

# Write (lazily initialises pikepdf, marks document dirty)
doc.metadata.title  = "Annual Report 2025"
doc.metadata.author = "Kevin Qiu"
doc.save("updated.pdf")
```

#### Document outline

```python theme={null}
doc.outline -> Outline
```

Document outline (table of contents) as an [Outline](#outline) tree. Returns an object with `len == 0` when the document has no bookmarks. Uses pypdfium2 — no pikepdf cost for read-only access.

```python theme={null}
for item in doc.outline.items:
    print(f"[p{item.page + 1}] {item.title}")

flat = doc.outline.to_list()  # PyMuPDF-compatible flat list
```

#### Encryption status

```python theme={null}
doc.is_encrypted -> bool
```

Whether the document is password-protected (read-only). Returns `True` even when the correct password has been provided and the document opened successfully.

#### Page sequence

```python theme={null}
doc.pages -> _PageList
```

Lazy sequence of all pages (read-only). Supports iteration and slicing. Commonly used with `render_pages()`.

### Page Access

#### Access page by index

**Signature:** `doc[index]` / `doc.load_page(index)`

```python theme={null}
doc[index: int] -> Page
doc.load_page(index: int) -> Page
```

Retrieves a page by 0-based index. Negative indices are supported (`doc[-1]` returns the last page).

| Exception   | Condition              |
| ----------- | ---------------------- |
| `PageError` | Index is out of range. |

```python theme={null}
first_page = doc[0]
last_page  = doc[-1]
third_page = doc.load_page(2)
```

#### Iteration

```python theme={null}
for page in doc:
    print(page.number)
```

### Split

#### Split document by pages

**Signature:** `doc.split(pages, output)`

```python theme={null}
doc.split(
    pages: list[int],
    output: str | pathlib.Path | None = None,
) -> Document
```

Extracts specified pages from the current document and returns a new `Document` object.

<ParamField path="pages" type="list[int]">
  List of 0-based page indices to extract. The output order matches the list order.
</ParamField>

<ParamField path="output" type="str | Path | None" default="None">
  If provided, the new document is also written to this path. Otherwise, it is returned in memory only.
</ParamField>

**Returns:** `Document` — A new document containing the specified pages.

```python theme={null}
# Extract the first 3 pages and save to disk
chapter = doc.split(pages=[0, 1, 2], output="chapter1.pdf")

# Extract to memory only, no disk write
excerpt = doc.split(pages=[4, 5, 6])
```

#### Split into single-page files

**Signature:** `doc.split_each(output_dir)`

```python theme={null}
doc.split_each(output_dir: str | pathlib.Path) -> None
```

Saves each page as a separate PDF file. Files are named `page_0.pdf`, `page_1.pdf`, etc.

<ParamField path="output_dir" type="str | Path">
  Output directory path. Created automatically if it does not exist.
</ParamField>

```python theme={null}
doc.split_each("pages/")
# Produces: pages/page_0.pdf, pages/page_1.pdf, ...
```

### Merge

#### Append pages from another document

**Signature:** `doc.append(other)`

```python theme={null}
doc.append(other: Document) -> None
```

Appends all pages of another document to the end of this document. After calling this method, the document is marked as modified and must be saved via `save()` or `to_bytes()` to persist the change.

<ParamField path="other" type="Document">
  The document whose pages will be appended.
</ParamField>

```python theme={null}
with sopdf.open("part1.pdf") as doc_a, sopdf.open("part2.pdf") as doc_b:
    doc_a.append(doc_b)
    doc_a.save("combined.pdf")
```

### Save

#### Save to file

**Signature:** `doc.save(path, *, compress, garbage, linearize)`

```python theme={null}
doc.save(
    path: str | pathlib.Path,
    *,
    compress: bool = True,
    garbage: bool = False,
    linearize: bool = False,
) -> None
```

Writes the document to disk.

<ParamField path="path" type="str | Path">
  Destination file path.
</ParamField>

<ParamField path="compress" type="bool" default="True">
  Whether to compress content streams. Can significantly reduce file size.
</ParamField>

<ParamField path="garbage" type="bool" default="False">
  Whether to generate object streams for additional structural compression.
</ParamField>

<ParamField path="linearize" type="bool" default="False">
  Whether to linearize the PDF for optimized sequential network access (Fast Web View).
</ParamField>

```python theme={null}
# Basic save (compression enabled by default)
doc.save("output.pdf")

# Maximum compression
doc.save("output.pdf", compress=True, garbage=True)

# Strip encryption (open with the correct password, then save)
doc.save("unlocked.pdf")
```

#### Export as bytes

**Signature:** `doc.to_bytes(*, compress)`

```python theme={null}
doc.to_bytes(compress: bool = True) -> bytes
```

Serializes the document to bytes without writing to disk. Useful for in-memory processing or serving a PDF over a network.

<ParamField path="compress" type="bool" default="True">
  Whether to compress content streams.
</ParamField>

**Returns:** `bytes` — The complete PDF file contents as bytes.

```python theme={null}
pdf_bytes = doc.to_bytes()

# Return directly as a Flask HTTP response
from flask import Response
return Response(doc.to_bytes(), mimetype="application/pdf")
```

### Lifecycle

#### Close document

**Signature:** `doc.close()`

```python theme={null}
doc.close() -> None
```

Closes the document and releases all file handles and memory resources. Using a `with` statement is recommended over calling this directly.

#### Context Manager

```python theme={null}
with sopdf.open("file.pdf") as doc:
    ...
# close() is called automatically on exit
```

[Back to top](#top)

***

<a id="page-ops" />

## Page Object Operations

Use this section for single-page workflows such as rendering, text extraction, and text search.

`Page` represents a single page within a document. Obtained via `doc[i]` or `doc.load_page(i)` — never constructed directly.

### Properties

#### Page index

**Member:** `page.number`

```python theme={null}
page.number -> int
```

The 0-based index of this page (read-only).

#### Page dimensions

**Member:** `page.rect`

```python theme={null}
page.rect -> Rect
```

The page dimensions as a `Rect` in PDF points (1 pt = 1/72 inch) (read-only). Use `rect.width` and `rect.height` to get the page size.

#### Page rotation

**Member:** `page.rotation`

```python theme={null}
page.rotation -> int          # read current rotation
page.rotation = degrees: int  # set rotation
```

The page rotation in degrees. Must be one of `0`, `90`, `180`, `270` (read/write).

| Exception   | Condition                                     |
| ----------- | --------------------------------------------- |
| `PageError` | Set to a value other than 0, 90, 180, or 270. |

### Rendering

#### Render to image bytes

**Signature:** `page.render(*, dpi, format, alpha)`

```python theme={null}
page.render(
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
) -> bytes
```

Renders the page to encoded image bytes.

<ParamField path="dpi" type="int" default="72">
  Rendering resolution in dots per inch. Use 72 for screen preview, 300 for print quality.
</ParamField>

<ParamField path="format" type="str" default="&#x22;png&#x22;">
  Output image format: `"png"` or `"jpeg"`.
</ParamField>

<ParamField path="alpha" type="bool" default="False">
  Whether to include an alpha (transparency) channel. Only effective for PNG; JPEG does not support transparency.
</ParamField>

**Recommended parameter presets**

| Scenario          | Recommended parameters               |
| ----------------- | ------------------------------------ |
| On-screen preview | `dpi=72, format="png", alpha=False`  |
| Crisp snapshot    | `dpi=150, format="png", alpha=False` |
| Print-grade image | `dpi=300, format="png", alpha=False` |

**Returns:** `bytes` — Encoded image bytes (PNG or JPEG).

```python theme={null}
png_bytes  = page.render(dpi=150)
jpeg_bytes = page.render(dpi=150, format="jpeg")
png_alpha  = page.render(dpi=72, alpha=True)
```

#### Render and save image

**Signature:** `page.render_to_file(path, *, dpi, format, alpha)`

```python theme={null}
page.render_to_file(
    path: str | pathlib.Path,
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
) -> None
```

Renders the page and writes the image to a file. Parameters are identical to `render()`.

<ParamField path="path" type="str | Path">
  Output file path (including extension).
</ParamField>

<ParamField path="dpi" type="int" default="72">
  Rendering resolution in dots per inch.
</ParamField>

<ParamField path="format" type="str" default="&#x22;png&#x22;">
  Output image format: `"png"` or `"jpeg"`.
</ParamField>

<ParamField path="alpha" type="bool" default="False">
  Whether to include an alpha channel (PNG only).
</ParamField>

```python theme={null}
page.render_to_file("page0.png", dpi=300)
page.render_to_file("page0.jpg", dpi=150, format="jpeg")
```

### Text Extraction

#### Extract plain text

**Signature:** `page.get_text(*, rect)`

```python theme={null}
page.get_text(
    *,
    rect: Rect | None = None,
) -> str
```

Extracts plain text from the page.

<ParamField path="rect" type="Rect | None" default="None">
  Restrict extraction to this rectangular region. Extracts the full page when `None`.
</ParamField>

**Returns:** `str` — The extracted plain text.

```python theme={null}
full_text = page.get_text()

# Extract from a specific region only
region = Rect(0, 0, 300, 100)
header_text = page.get_text(rect=region)
```

#### Extract text blocks

**Signature:** `page.get_text_blocks(*, rect, format)`

```python theme={null}
page.get_text_blocks(
    *,
    rect: Rect | None = None,
    format: str = "list",
) -> list
```

Extracts structured text blocks with bounding boxes.

<ParamField path="rect" type="Rect | None" default="None">
  Restrict extraction to this rectangular region. Extracts the full page when `None`.
</ParamField>

<ParamField path="format" type="str" default="&#x22;list&#x22;">
  Return format. `"list"` returns a list of `TextBlock` objects; `"dict"` returns a list of plain dictionaries with `"text"` and `"rect"` keys.
</ParamField>

**Returns:** `format="list"` → `list[TextBlock]`; `format="dict"` → `list[dict]`, each of the form `{"text": "...", "rect": {"x0": ..., "y0": ..., "x1": ..., "y1": ...}}`

```python theme={null}
blocks = page.get_text_blocks()
for block in blocks:
    print(block.text, block.rect)

# Return as dictionaries (convenient for JSON serialization)
dicts = page.get_text_blocks(format="dict")
```

### Text Search

#### Search text positions

**Signature:** `page.search(query, *, match_case)`

```python theme={null}
page.search(
    query: str,
    *,
    match_case: bool = False,
) -> list[Rect]
```

Searches the page for a text string and returns the bounding rectangles of all matches.

<ParamField path="query" type="str">
  The text string to search for.
</ParamField>

<ParamField path="match_case" type="bool" default="False">
  Whether the search is case-sensitive. Case-insensitive by default.
</ParamField>

**Returns:** `list[Rect]` — Bounding rectangles for each match. Returns an empty list if no matches are found.

```python theme={null}
hits = page.search("invoice")
for rect in hits:
    print(f"Match at {rect}")

# Case-sensitive search
hits = page.search("PDF", match_case=True)
```

#### Search text with context blocks

**Signature:** `page.search_text_blocks(query, *, match_case)`

```python theme={null}
page.search_text_blocks(
    query: str,
    *,
    match_case: bool = False,
) -> list[dict]
```

Searches for text and returns each match along with the surrounding text block for context.

<ParamField path="query" type="str">
  The text string to search for.
</ParamField>

<ParamField path="match_case" type="bool" default="False">
  Whether the search is case-sensitive.
</ParamField>

**Returns:** `list[dict]`, each element contains:

| Key            | Type   | Description                                               |
| -------------- | ------ | --------------------------------------------------------- |
| `"text"`       | `str`  | Full text content of the block containing the match.      |
| `"rect"`       | `Rect` | Bounding rectangle of the containing text block.          |
| `"match_rect"` | `Rect` | Precise bounding rectangle of the matched keyword itself. |

```python theme={null}
results = page.search_text_blocks("total amount")
for r in results:
    print(r["text"])        # full paragraph containing the keyword
    print(r["match_rect"])  # exact position of the keyword
```

[Back to top](#top)

***

<a id="data-types" />

## Data Types

Refer to this section when you need to understand response structures (for example `Rect`, `TextBlock`, and `Metadata`) for downstream processing.

### Rect

Represents a rectangular region. Coordinates are in PDF points (pt, where 1 pt = 1/72 inch). The coordinate system has its origin at the top-left corner of the page, with x increasing rightward and y increasing downward.

```python theme={null}
Rect(x0: float, y0: float, x1: float, y1: float)
```

**Constructor Parameters**

| Parameter | Type    | Description                                            |
| --------- | ------- | ------------------------------------------------------ |
| `x0`      | `float` | Left edge (x-coordinate of the top-left corner).       |
| `y0`      | `float` | Top edge (y-coordinate of the top-left corner).        |
| `x1`      | `float` | Right edge (x-coordinate of the bottom-right corner).  |
| `y1`      | `float` | Bottom edge (y-coordinate of the bottom-right corner). |

**Core properties (common)**

| Property | Type    | Description                           |
| -------- | ------- | ------------------------------------- |
| `x0`     | `float` | Left edge.                            |
| `y0`     | `float` | Top edge.                             |
| `x1`     | `float` | Right edge.                           |
| `y1`     | `float` | Bottom edge.                          |
| `width`  | `float` | Rectangle width, equal to `x1 - x0`.  |
| `height` | `float` | Rectangle height, equal to `y1 - y0`. |

<details>
  <summary>Advanced properties and methods (expand)</summary>

  **Advanced properties**

  | Property   | Type   | Description                              |
  | ---------- | ------ | ---------------------------------------- |
  | `is_valid` | `bool` | `True` when `x0 ≤ x1` and `y0 ≤ y1`.     |
  | `is_empty` | `bool` | `True` when the rectangle has zero area. |

  **Methods**

  | Method                | Returns | Description                                                                                                                              |
  | --------------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
  | `get_area()`          | `float` | Rectangle area. Returns `0` for invalid rectangles.                                                                                      |
  | `contains(other)`     | `bool`  | If `other` is a `Rect`, returns `True` if it is fully contained. If `other` is an `(x, y)` tuple, returns `True` if the point is inside. |
  | `intersects(other)`   | `bool`  | Returns `True` if the two rectangles overlap (touching edges count).                                                                     |
  | `intersect(other)`    | `Rect`  | Returns the intersection region. Returns an empty `Rect` if there is no overlap.                                                         |
  | `include_rect(other)` | `Rect`  | Returns the smallest bounding rectangle that contains both rectangles.                                                                   |
  | `include_point(x, y)` | `Rect`  | Returns a new rectangle expanded to include the given point.                                                                             |
</details>

All geometric operations return new `Rect` instances — the original is immutable.

```python theme={null}
r = Rect(10, 20, 200, 300)
print(r.width)    # 190.0
print(r.height)   # 280.0

# Containment check
print(r.contains(Rect(50, 50, 100, 100)))  # True
print(r.contains((50, 50)))                # True (point)

# Intersection
a = Rect(0, 0, 100, 100)
b = Rect(50, 50, 150, 150)
print(a.intersect(b))  # Rect(50, 50, 100, 100)

# Unpack
x0, y0, x1, y1 = r
```

### TextBlock

Represents a single block of text on a page, together with its bounding box.

```python theme={null}
TextBlock(text: str, rect: Rect)
```

| Attribute / Method | Type   | Description                                                                        |
| ------------------ | ------ | ---------------------------------------------------------------------------------- |
| `text`             | `str`  | The text content of the block.                                                     |
| `rect`             | `Rect` | Bounding rectangle of the block on the page.                                       |
| `to_dict()`        | `dict` | Converts to `{"text": ..., "rect": {"x0": ..., "y0": ..., "x1": ..., "y1": ...}}`. |

```python theme={null}
blocks = page.get_text_blocks()
for block in blocks:
    print(block.text)
    print(block.rect.width, block.rect.height)
    print(block.to_dict())
```

### Metadata

Read/write proxy for the PDF Document Info dictionary. Obtained via `doc.metadata` — never constructed directly.

**Read path** (zero pikepdf cost): each property calls `pypdfium2.get_metadata_dict()` after auto-syncing.

**Write path** (lazy pikepdf init): each setter calls `_ensure_pike()`, writes to `pike_doc.docinfo`, and marks the document dirty. The next read auto-syncs.

**Core fields (common)**

| Property   | Type          | Description                                |
| ---------- | ------------- | ------------------------------------------ |
| `title`    | `str \| None` | Document title (`/Title`). Read/write.     |
| `author`   | `str \| None` | Author name (`/Author`). Read/write.       |
| `subject`  | `str \| None` | Document subject (`/Subject`). Read/write. |
| `keywords` | `str \| None` | Search keywords (`/Keywords`). Read/write. |

<details>
  <summary>Advanced fields and methods (expand)</summary>

  **Advanced fields**

  | Property            | Type               | Description                                                                                         |
  | ------------------- | ------------------ | --------------------------------------------------------------------------------------------------- |
  | `creator`           | `str \| None`      | Authoring tool that created the source document (`/Creator`). Read/write.                           |
  | `producer`          | `str \| None`      | Tool that produced the PDF (`/Producer`). Read/write.                                               |
  | `creation_date`     | `str \| None`      | Raw PDF creation date string (`/CreationDate`). Read/write.                                         |
  | `mod_date`          | `str \| None`      | Raw PDF modification date string (`/ModDate`). Read/write.                                          |
  | `creation_datetime` | `datetime \| None` | `creation_date` parsed as a Python `datetime`. Read-only. Returns `None` if missing or unparseable. |
  | `mod_datetime`      | `datetime \| None` | `mod_date` parsed as a Python `datetime`. Read-only. Returns `None` if missing or unparseable.      |

  **Methods**

  | Method             | Returns                  | Description                                                                           |
  | ------------------ | ------------------------ | ------------------------------------------------------------------------------------- |
  | `to_dict()`        | `dict[str, str \| None]` | All fields as a dict with lowercase keys. Matches the old `doc.metadata` dict format. |
  | `__getitem__(key)` | `str \| None`            | `meta["title"]` — backward-compatible dict-style access.                              |
</details>

**PDF date string format:** `D:YYYYMMDDHHmmSSOHH'mm'` (prefix `D:` and timezone optional).

```python theme={null}
with sopdf.open("report.pdf") as doc:
    meta = doc.metadata

    # Read individual fields
    print(meta.title)
    print(meta.creation_datetime)   # datetime(2024, 1, 1, 12, 0, tzinfo=...)

    # Write
    meta.title  = "New Title"
    meta.author = "Kevin Qiu"
    doc.save("updated.pdf")

    # Dict-style read (backward compat)
    d = meta.to_dict()
    print(d["title"])
    print(meta["author"])
```

### OutlineItem

An immutable bookmark node in the document outline.

```python theme={null}
@dataclass(frozen=True)
class OutlineItem:
    title:    str
    page:     int                          # 0-based; -1 = no destination page
    level:    int                          # 0 = top-level
    children: tuple[OutlineItem, ...] = ()
```

| Attribute / Method | Type                      | Description                                                       |
| ------------------ | ------------------------- | ----------------------------------------------------------------- |
| `title`            | `str`                     | Bookmark label as displayed in the reader TOC panel.              |
| `page`             | `int`                     | 0-based target page index; `-1` when the item has no destination. |
| `level`            | `int`                     | Nesting depth; `0` = top-level item.                              |
| `children`         | `tuple[OutlineItem, ...]` | Nested child items (frozen tuple).                                |
| `to_dict()`        | `dict`                    | Serialize to a plain dict (recursive).                            |

### Outline

Read-only bookmark tree manager. Obtained via `doc.outline` — never constructed directly. The tree is built once on first access using pypdfium2's TOC data — no pikepdf initialisation needed.

| Member          | Returns             | Description                                                                                                              |
| --------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `items`         | `list[OutlineItem]` | Top-level outline items (each may have nested `children`).                                                               |
| `to_list()`     | `list[dict]`        | Flat DFS traversal. Each entry: `{"level": int, "title": str, "page": int}`. Compatible with PyMuPDF `get_toc()` output. |
| `len(outline)`  | `int`               | Total number of nodes across all nesting levels.                                                                         |
| `iter(outline)` | —                   | Iterate over top-level items.                                                                                            |
| `bool(outline)` | `bool`              | `True` when the document has at least one outline item.                                                                  |

```python theme={null}
with sopdf.open("textbook.pdf") as doc:
    outline = doc.outline
    print(outline)          # Outline(top_level=2, total=4)
    print(bool(outline))    # True

    # Recursive tree traversal
    def print_tree(items, indent=0):
        for item in items:
            print("  " * indent + f"[p{item.page + 1}] {item.title}")
            print_tree(item.children, indent + 1)

    print_tree(outline.items)

    # Flat list (PyMuPDF-compatible)
    for row in outline.to_list():
        print(f"{'  ' * row['level']}{row['title']}  →  p{row['page'] + 1}")
```

[Back to top](#top)

***

<a id="exceptions" />

## Exceptions

Read this section first when integrating PDF processing into production services and designing robust error handling.

All exceptions inherit from `PDFError`, which inherits from the built-in `RuntimeError`.

```
RuntimeError
└── PDFError
    ├── PasswordError
    ├── FileDataError
    └── PageError
```

| Exception       | When Raised                                                                            | Handling Guidance                                                   | Recoverable        |
| --------------- | -------------------------------------------------------------------------------------- | ------------------------------------------------------------------- | ------------------ |
| `PDFError`      | Base class for all sopdf exceptions. Catch this to handle any sopdf error.             | Use as a top-level fallback for logging and unified user messaging. | Depends on subtype |
| `PasswordError` | Opening an encrypted PDF with a missing or incorrect password.                         | Prompt again for password and limit retries.                        | Yes                |
| `FileDataError` | PDF file is corrupted, has an invalid format, or cannot be parsed.                     | Ask user to re-upload or replace the source file.                   | No                 |
| `PageError`     | Page index is out of range, or rotation is set to an invalid value (not 0/90/180/270). | Validate index/range and rotation before calling.                   | Yes                |

**Recommended catch order:** catch specific exceptions first (`PasswordError` / `FileDataError` / `PageError`), then `PDFError` as a final fallback.

```python theme={null}
import sopdf

try:
    doc = sopdf.open("file.pdf", password="wrong")
except sopdf.PasswordError:
    print("Incorrect password")
except sopdf.FileDataError:
    print("File is corrupted")
except sopdf.PDFError as e:
    print(f"PDF error: {e}")
```

[Back to top](#top)
