Unsafe NumPy pickle deserialization in cluster_faces.py (allow_pickle=True) #1

Open
opened 2026-05-14 20:34:51 +02:00 by Claude · 0 comments

Problem

cluster_faces.py loads faces_raw.npz using numpy.load(..., allow_pickle=True) in two places: inside detect_pass() and cluster_pass(). NumPy's pickle path allows arbitrary Python objects to be embedded in .npz files; deserializing a crafted file executes whatever code the pickle payload contains.

Location

cluster_faces.pydetect_pass() function (incremental-resume block):

with np.load(raw_path, allow_pickle=True) as z:
    rels  = z["rels"].tolist()
    ...

And cluster_pass():

with np.load(raw_path, allow_pickle=True) as z:
    rels  = z["rels"].tolist()
    ...

Risk

allow_pickle=True is required here because the arrays are stored with dtype=object (jagged lists of varying-length vectors). If an attacker can place or replace faces_raw.npz in the export directory — e.g., by supplying a specially crafted BeReal export archive — running python3 cluster_faces.py would execute arbitrary code on the operator's machine with the user's privileges.

Suggested fix direction

Avoid dtype=object arrays and pickle altogether. Store embeddings as a flat float32 array with a separate integer offset/length array, and metadata as a JSON sidecar file. That lets np.load be called without allow_pickle.

Severity

moderate

Found by

Automated audit by Claude Code

## Problem `cluster_faces.py` loads `faces_raw.npz` using `numpy.load(..., allow_pickle=True)` in two places: inside `detect_pass()` and `cluster_pass()`. NumPy's pickle path allows arbitrary Python objects to be embedded in `.npz` files; deserializing a crafted file executes whatever code the pickle payload contains. ## Location `cluster_faces.py` — `detect_pass()` function (incremental-resume block): ```python with np.load(raw_path, allow_pickle=True) as z: rels = z["rels"].tolist() ... ``` And `cluster_pass()`: ```python with np.load(raw_path, allow_pickle=True) as z: rels = z["rels"].tolist() ... ``` ## Risk `allow_pickle=True` is required here because the arrays are stored with `dtype=object` (jagged lists of varying-length vectors). If an attacker can place or replace `faces_raw.npz` in the export directory — e.g., by supplying a specially crafted BeReal export archive — running `python3 cluster_faces.py` would execute arbitrary code on the operator's machine with the user's privileges. ## Suggested fix direction Avoid `dtype=object` arrays and pickle altogether. Store embeddings as a flat `float32` array with a separate integer offset/length array, and metadata as a JSON sidecar file. That lets `np.load` be called without `allow_pickle`. ## Severity moderate ## Found by Automated audit by Claude Code
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
bc1bb/BeReal-extractor#1
No description provided.