FLUX DOCUMENTATION SYSTEM Layer 5 — INTELLIGENCE | metadata-enrichment flux.dantesisofo.com/wiki/metadata-enrichment/

METADATA ENRICHMENT

1. THE METADATA DATABASE

Every photograph in the FLUX corpus receives a full metadata record in a relational database.

The metadata database is the join layer between: - the archive (files on NAS) - the intelligence layer (embeddings, keeper scores, motif clusters) - the publishing layer (issue assignments, catalog entries)

Initial implementation: SQLite
Upgrade path:           PostgreSQL (if corpus exceeds ~5M rows or concurrent writes required)
Location:               /FLUX_METADATA/flux.db (on NAS)
Backup:                 nightly snapshot to /FLUX_METADATA/backups/

SQLite is correct for Phase 1. The corpus is ~400,000 rows. SQLite handles this without issue. A single-file database on the NAS is simple, portable, and directly inspectable.

2. FULL SCHEMA

CREATE TABLE photos (
  photo_id        TEXT PRIMARY KEY,
  filename        TEXT NOT NULL,
  sha256          TEXT NOT NULL UNIQUE,
  captured_at     TEXT,              -- ISO 8601 timestamp from EXIF
  camera_make     TEXT,
  camera_model    TEXT,
  focal_length_mm REAL,
  aperture        TEXT,
  shutter_speed   TEXT,
  iso             INTEGER,
  lat             REAL,
  lon             REAL,
  location_city   TEXT,              -- reverse geocoded
  location_country TEXT,
  weather         TEXT,              -- future: from API at capture time
  keeper          INTEGER DEFAULT 0, -- 0 or 1
  keeper_score    REAL,              -- 0.0–1.0 from model
  issue_id        TEXT,              -- FLUX_NNN or NULL
  catalog_id      TEXT,              -- CAT_NNN or NULL
  embedding_id    TEXT,              -- reference to embeddings table
  tags            TEXT,              -- JSON array
  created_at      TEXT DEFAULT (datetime('now')),
  updated_at      TEXT DEFAULT (datetime('now'))
);

CREATE INDEX idx_photos_captured_at   ON photos(captured_at);
CREATE INDEX idx_photos_sha256        ON photos(sha256);
CREATE INDEX idx_photos_keeper        ON photos(keeper);
CREATE INDEX idx_photos_issue_id      ON photos(issue_id);
CREATE INDEX idx_photos_catalog_id    ON photos(catalog_id);
CREATE INDEX idx_photos_location      ON photos(location_city, location_country);

CREATE TABLE embeddings (
  embedding_id    TEXT PRIMARY KEY,
  photo_id        TEXT NOT NULL REFERENCES photos(photo_id),
  model           TEXT NOT NULL,     -- e.g., 'clip-vit-l-14'
  dimensions      INTEGER NOT NULL,  -- 512 or 768
  vector_path     TEXT,              -- path to .npy file or sqlite-vec rowid
  created_at      TEXT DEFAULT (datetime('now'))
);

CREATE TABLE issues (
  issue_id        TEXT PRIMARY KEY,  -- FLUX_NNN
  created_at      TEXT NOT NULL,
  frame_count     INTEGER NOT NULL CHECK (frame_count = 36),
  pdf_path        TEXT,
  s3_key          TEXT,
  manifest_path   TEXT
);

CREATE TABLE catalog_entries (
  catalog_id      TEXT PRIMARY KEY,  -- CAT_NNN
  photographer    TEXT NOT NULL,
  title           TEXT,
  submitted_at    TEXT NOT NULL,
  pdf_path        TEXT,
  s3_key          TEXT
);

3. ENRICHMENT PIPELINE

Per-image enrichment workflow. Runs once on ingest; individual steps can be re-run if a step fails.

1.  INGEST
    Receive JPEG in /FLUX_INBOX/
    Generate canonical filename
    Compute SHA-256 hash
    Check deduplication (SELECT 1 FROM photos WHERE sha256 = ?)
    If duplicate: log and skip

2.  EXIF EXTRACTION
    Extract all EXIF fields (see Section 4)
    Parse captured_at from DateTimeOriginal
    Extract GPS coordinates if present

3.  REVERSE GEOCODING
    If lat/lon present: reverse geocode → city + country (see Section 5)
    Write location_city, location_country

4.  WEATHER ENRICHMENT
    If captured_at + lat/lon present: query Open-Meteo historical API (see Section 6)
    Write weather condition string

5.  EMBEDDING GENERATION
    Run CLIP inference → float vector
    Write to embeddings table
    Write embedding_id to photos table

6.  AUTO-TAGGING
    Run CLIP zero-shot classification against tag vocabulary (see Section 7)
    Write tags as JSON array

7.  KEEPER SCORING
    If keeper model is trained: run inference → float 0.0–1.0
    Write keeper_score
    If photo_id in keeper archive: set keeper = 1, keeper_score = 1.0

8.  DATABASE WRITE
    INSERT OR REPLACE INTO photos (...)
    Commit transaction

9.  THUMBNAIL GENERATION
    Generate 800px thumbnail
    Write to /FLUX_PUBLIC/thumbs/{photo_id}.jpg
    Sync to S3 thumbs/

4. EXIF EXTRACTION

EXIF fields extracted per photograph:

Field              EXIF Tag              Notes
captured_at        DateTimeOriginal      Primary timestamp; fall back to DateTime
camera_make        Make                  e.g., "RICOH"
camera_model       Model                 e.g., "GR IIIx"
focal_length_mm    FocalLength           Stored as REAL (e.g., 26.1)
aperture           FNumber               Stored as string (e.g., "f/5.6")
shutter_speed      ExposureTime          Stored as string (e.g., "1/500")
iso                ISOSpeedRatings       Stored as INTEGER
lat                GPSLatitude           Decimal degrees, positive = N
lon                GPSLongitude          Decimal degrees, positive = E

Implementation:

import exifread

def extract_exif(filepath):
    with open(filepath, 'rb') as f:
        tags = exifread.process_file(f, stop_tag='GPS GPSLongitude', details=False)

    captured_at = tags.get('EXIF DateTimeOriginal')
    lat         = _parse_gps(tags.get('GPS GPSLatitude'), tags.get('GPS GPSLatitudeRef'))
    lon         = _parse_gps(tags.get('GPS GPSLongitude'), tags.get('GPS GPSLongitudeRef'))
    # ... etc.
    return ExifRecord(captured_at=captured_at, lat=lat, lon=lon, ...)

Camera model support: Ricoh GR III, GR IIIx. All EXIF fields are present and correct. No special handling required.

Missing EXIF handling: - captured_at missing: use file modification timestamp as fallback; log warning - GPS missing: leave lat/lon NULL; skip geocoding and weather enrichment - ISO missing: leave NULL; do not substitute a default

5. REVERSE GEOCODING

Convert GPS coordinates to human-readable location.

Method:   Nominatim (OpenStreetMap geocoding service)
Mode:     LOCAL — run Nominatim locally or use the public API with rate limiting
Privacy:  No third-party API calls for batch processing (400,000 coordinate lookups)
Cache:    SQLite cache table (lat/lon rounded to 4 decimal places → city/country)

Local Nominatim is required for batch processing. The public Nominatim API enforces a 1 request/second rate limit. At 400,000 photographs, full corpus geocoding via the public API would take ~111 hours and violates the usage policy.

Alternative: use a local GeoPy geocoder with a pre-downloaded GeoNames database. Lower accuracy, no rate limit, fully offline.

def reverse_geocode(lat, lon, cache_db):
    # Round coordinates to 4 decimal places (~11m precision)
    lat_r = round(lat, 4)
    lon_r = round(lon, 4)

    # Check cache
    cached = cache_db.get(lat_r, lon_r)
    if cached:
        return cached

    # Query Nominatim
    result = nominatim.reverse(f"{lat_r},{lon_r}", language='en')
    city    = result.raw.get('address', {}).get('city') or result.raw.get('address', {}).get('town')
    country = result.raw.get('address', {}).get('country')

    cache_db.set(lat_r, lon_r, city, country)
    return city, country

6. WEATHER ENRICHMENT

Match capture timestamp and GPS coordinates to historical weather condition.

API:       Open-Meteo historical weather API (open source, no API key required)
Fields:    weathercode (WMO code → condition string), temperature_2m, precipitation
Precision: 1-hour resolution
Cache:     SQLite cache table (date + lat/lon rounded to 1 decimal → weather)

WMO weather code mapping (partial):

0          — clear sky
1, 2, 3    — mainly clear, partly cloudy, overcast
45, 48     — fog, rime fog
51, 53, 55 — drizzle (light, moderate, dense)
61, 63, 65 — rain (slight, moderate, heavy)
71, 73, 75 — snow (slight, moderate, heavy)
80, 81, 82 — rain showers
95         — thunderstorm

The stored weather field is a human-readable string: "fog", "rain", "clear", "overcast", "drizzle", "snow".

Weather enrichment is a FUTURE/PROPOSED step. It requires GPS coordinates and a captured timestamp. Photographs without GPS cannot be weather-enriched.

7. AUTO-TAGGING

Subject matter tags generated using CLIP zero-shot classification.

Model:    CLIP (same model used for embeddings)
Method:   Compute text embeddings for vocabulary; find top-N closest to image embedding
Vocab:    curated tag vocabulary (see below)
Output:   JSON array of strings stored in photos.tags

Tag vocabulary (initial):

people, crowd, solitude, gesture, silhouette, shadow, reflection,
umbrella, rain, fog, snow, clear, night, dawn, dusk, golden hour,
street, alley, intersection, transit, platform, station,
building, facade, doorway, window, signage, construction,
motion blur, static, compression, layering, depth,
Philadelphia, Rome, transit, neighborhood

Vocabulary is curated and extended over time. Tags are not exhaustive — they encode the vocabulary of the photographer's practice.

def auto_tag(image_embedding, text_embeddings, vocab, top_k=5, threshold=0.25):
    # image_embedding: L2-normalized float vector
    # text_embeddings: dict {tag: L2-normalized vector}
    scores = {tag: np.dot(image_embedding, vec) for tag, vec in text_embeddings.items()}
    top    = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    return [tag for tag, score in top if score >= threshold]

8. MANIFEST GENERATION

Per-issue manifests are already generated by the live system. The metadata database formalizes this.

Manifest schema (per-issue manifest.json):

{
  "issue_id": "FLUX_019",
  "generated_at": "2026-05-15T14:22:00Z",
  "frame_count": 36,
  "frames": [
    {
      "position": 1,
      "filename": "2023-01-15_13-22-44_DanteSisofo_R0001234.JPG",
      "sha256": "a3f7b92d4e8c1f...",
      "captured_at": "2023-01-15T13:22:44",
      "lat": 39.9432,
      "lon": -75.1598,
      "location_city": "Philadelphia",
      "location_country": "United States"
    },
    ...
  ]
}

The manifest is generated by generate_issue_metadata.py and uploaded to: - S3: FLUX_ISSUES/FLUX_019/manifest.json - NAS: /FLUX_ISSUES/FLUX_019/manifest.json

The manifest page in the PDF (page 43) is generated from this JSON.

9. EXAMPLE RECORD

Full metadata record for a single photograph:

{
  "photo_id":       "2023-01-15_13-22-44_DanteSisofo_R0001234",
  "filename":       "2023-01-15_13-22-44_DanteSisofo_R0001234.JPG",
  "sha256":         "a3f7b92d4e8c1f2a3b4c5d6e7f8g9h0i",
  "captured_at":    "2023-01-15T13:22:44",
  "camera_make":    "RICOH",
  "camera_model":   "GR IIIx",
  "focal_length_mm": 26.1,
  "aperture":       "f/5.6",
  "shutter_speed":  "1/500",
  "iso":            400,
  "lat":            39.9432,
  "lon":            -75.1598,
  "location_city":  "Philadelphia",
  "location_country": "United States",
  "weather":        "fog",
  "keeper":         1,
  "keeper_score":   1.0,
  "issue_id":       "FLUX_019",
  "catalog_id":     null,
  "embedding_id":   "emb_a3f7b92d",
  "tags":           ["fog", "silhouette", "umbrella", "street", "Philadelphia"],
  "created_at":     "2026-05-20T09:00:00",
  "updated_at":     "2026-05-20T09:00:00"
}

Document	Layer	Relationship
INTELLIGENCE	Layer 5 — Intelligence	Layer overview; metadata enrichment is a subdocument
EMBEDDINGS	Layer 5 — Intelligence	Embedding pipeline that feeds embedding_id into this schema
KEEPER MODEL	Layer 5 — Intelligence	Model that populates keeper_score field
ARCHIVE	Layer 2 — Protocol	Digital archive structure that metadata enriches
BOOTSTRAP	Layer 4 — Infrastructure	Phase 4 of the implementation plan is metadata layer initialization