FLUX DOCUMENTATION SYSTEM Layer 5 — INTELLIGENCE | training-data flux.dantesisofo.com/wiki/training-data/

TRAINING DATA

1. THE DATASET

The FLUX keeper archive is a labeled binary classification dataset.

Total corpus:         ~400,000 photographs
Positive class:       ~15,000 (keeper archive)
Negative class:       ~385,000 (full corpus, non-keeper)
Positive rate:        3.75%
Class imbalance:      ~26:1 (negative:positive)
Label source:         photographer selection over years of practice
Label quality:        high — each keeper was reviewed and consciously selected

This is not a generic photography dataset. There are no crowd-sourced labels. No MTurk workers. No aggregate aesthetic scores. The labels encode one photographer's demonstrated selection behavior, applied consistently across thousands of sessions over years.

That is exactly what makes it valuable as training data for a personal taste model.

2. WHY THIS IS VALUABLE

Generic photography datasets (AVA, AADB, PCCD) aggregate aesthetic preferences across many judges. They produce models that predict average aesthetic quality — what people generally find appealing.

The FLUX keeper dataset contains no such aggregation. It is a single-photographer selection record.

What it encodes: - The specific visual vocabulary of the photographer's practice - The selection threshold in real conditions (not studio ratings) - Preference evolution over time (what was kept in 2022 vs. 2026 may differ) - The specific equipment response (Ricoh GR III/IIIx rendering characteristics) - The specific locations and subjects of this photographer's practice

A model trained on this dataset predicts not what is aesthetically good, but what this specific photographer probably keeps. For the purpose of the FLUX archive, that is the correct objective function.

3. KEEPER ARCHIVE STRUCTURE

/FLUX_ARCHIVE/KEEPERS/
    /2022/
        /2022-01/
            /2022-01-15/
                2022-01-15_13-22-44_DanteSisofo_R0001234.JPG
                2022-01-15_14-05-11_DanteSisofo_R0001289.JPG
        /2022-02/
            ...
    /2023/
        ...
    /2024/
        ...
    /2025/
        ...
    /2026/
        ...

Approximately 15,000 files. Organized by year and month. Filenames follow the canonical FLUX format.

The full corpus lives at:

/FLUX_ARCHIVE/ORIGINALS/
    /2022/
        /2022-01/
            /2022-01-15/
                [all ~200 photographs from that day's session]
    ...

Approximately 400,000 files. Same chronological folder structure. Some days have 10 photographs. Some days have 500.

4. MATCHING STRATEGY

The keeper archive and the full corpus are separate folder trees. Many files in the keeper archive are copies (or edited versions) of files in the full corpus. The match between them must be established explicitly.

Matching is applied in priority order. Stop at first match.

Step 1: EXIF timestamp match (primary)

def match_by_exif(keeper_file, corpus_index):
    """
    corpus_index: dict {DateTimeOriginal: [photo_ids]}
    """
    dt = get_exif_datetime(keeper_file)
    if dt and dt in corpus_index:
        candidates = corpus_index[dt]
        if len(candidates) == 1:
            return candidates[0]  # unique timestamp match
        # Multiple candidates at same timestamp: proceed to secondary
    return None

Step 2: Original filename match (secondary)

def match_by_filename(keeper_file, corpus_index):
    """
    Match by embedded Ricoh filename (R0001234.JPG component of canonical name).
    corpus_index: dict {original_filename: photo_id}
    """
    original = extract_original_filename(keeper_file)  # "R0001234.JPG"
    return corpus_index.get(original)

Step 3: SHA-256 hash match (tertiary)

def match_by_hash(keeper_file, corpus_hash_index):
    """
    corpus_hash_index: dict {sha256: photo_id}
    """
    h = sha256_file(keeper_file)
    return corpus_hash_index.get(h)

Step 4: Image similarity fallback

def match_by_similarity(keeper_file, corpus_embeddings, threshold=0.98):
    """
    Compute embedding of keeper file; find nearest neighbor in corpus.
    Use only if cosine similarity >= threshold.
    """
    keeper_vec = embed(keeper_file)
    result = corpus_embeddings.search(keeper_vec, k=1)
    if result[0].score >= threshold:
        return result[0].photo_id
    return None  # requires human review

Unmatched keepers: Flag for human review. Do not assign keeper label to an unknown corpus entry. Log the unmatched keeper filename.

5. NEAR-MISSES

Near-misses are the most valuable training signal in the dataset.

A near-miss is a photograph that was reviewed during the selection process and rejected — not because it was a technical failure, but because it was close but not quite right. It is a photograph that almost made it.

Why near-misses are valuable: - They define the selection threshold precisely. A clear reject does not tell you where the boundary is. A near-miss does. - They encode the most nuanced aspects of the photographer's taste: what quality, exactly, separates keep from reject? - They are more informative to the model than clear rejects.

The FLUX corpus provides implicit near-misses: any photograph from a session that also produced keepers is a candidate near-miss. If a session of 200 frames produced 8 keepers, the 192 non-keepers include some that came close.

Explicit near-miss labeling is a FUTURE/PROPOSED enhancement. Phase 1 uses binary labels: keeper (1) or not (0). Near-miss labels (0.5) can be added as a third class in a future iteration.

6. DATA QUALITY CONSIDERATIONS

EXIF may be missing or wrong: Some files may have had EXIF stripped by editing software. Some cameras write incorrect timestamps (clock drift, timezone errors). Handle: - Missing EXIF: fall through to filename/hash/similarity matching - Clock drift: detect systematic offset (if a session's timestamps are all 1 hour early, correct for timezone) - EXIF overwrite by editing: SHA-256 will differ from original; image similarity fallback is the correct path

Duplicate files: The full corpus may contain duplicate files from multiple import operations. SHA-256 deduplication at ingest removes true duplicates. Near-duplicates (same scene, different exposure) are not duplicates.

Re-edited keepers: Some keepers in the keeper archive may be lightly edited versions of the originals (cropped, exposure-adjusted). The SHA-256 will not match the corpus original. Use image similarity fallback. Store the original's photo_id as the canonical reference; note that the keeper version is edited.

Missing keepers: Some keepers may exist in the keeper archive but not in the current corpus (lost in a previous migration, never imported). These are unrecoverable without the original media card. Log them. Do not fabricate corpus entries.

7. DATASET CONSTRUCTION

Steps to build the labeled dataset from existing archives:

1. Import full corpus to NAS (/FLUX_ARCHIVE/ORIGINALS/)
   — ~400,000 Ricoh JPEGs from SanDisk SSD
   — Deduplicate by SHA-256 during import

2. Build corpus index
   — SQLite table: photo_id, sha256, captured_at, original_filename
   — Index all three fields for fast lookup

3. Import keeper archive to NAS (/FLUX_ARCHIVE/KEEPERS/)
   — ~15,000 selected JPEGs from existing keeper folder structure

4. Run matching pipeline (Section 4)
   — For each keeper file: attempt match in priority order
   — Write keeper=1, keeper_score=1.0 to matched corpus entry
   — Log unmatched keepers for manual review

5. Verify match quality
   — Count: expected ~15,000 matches
   — Inspect unmatched keepers: typically <5% of total
   — Manually resolve ambiguous matches

6. Generate embeddings for full corpus (see: EMBEDDINGS)
   — Required before any model training

7. Build training set
   — SELECT photo_id, embedding_id, keeper FROM photos
   — Split: 80% train, 10% validation, 10% test
   — Stratify split to preserve keeper rate in each split

8. Train keeper model (see: KEEPER MODEL)

8. AUGMENTATION

Standard data augmentation applies for keeper model training.

augment_transforms = [
    RandomHorizontalFlip(p=0.5),
    RandomCrop(size=(200, 200)),      # from 224x224 input
    ColorJitter(brightness=0.1, contrast=0.1, saturation=0.0, hue=0.0),
]

Restrictions: - No aggressive color shift: the photographer's color rendering preference is part of the aesthetic being learned. Distorting saturation or hue would corrupt this signal. - No rotation beyond ±5°: the photographer's horizon alignment is a compositional choice. Rotating images would corrupt the composition signal. - No vertical flip: the orientation of subjects in the frame is meaningful. - Horizontal flip is acceptable: left/right composition is often symmetric in the photographer's practice. - Brightness/contrast variation is acceptable within narrow bounds: it simulates exposure variation without changing the fundamental aesthetic.

Augmentation is applied to the training set only. Validation and test sets are unaugmented.

Document	Layer	Relationship
INTELLIGENCE	Layer 5 — Intelligence	Layer overview; training data is a subdocument
KEEPER MODEL	Layer 5 — Intelligence	The model trained on this dataset
EMBEDDINGS	Layer 5 — Intelligence	Provides features for the training examples
METADATA ENRICHMENT	Layer 5 — Intelligence	Database schema that stores keeper labels
BOOTSTRAP	Layer 4 — Infrastructure	Phase 3 is keeper matching; Phase 6 is model training