Estimate the latent dimensionality of a triplet dataset — estimate

Fits embeddings across a range of dimensionalities, each with multiple random restarts, and returns the hold-out loss at every (dimension, restart) combination. Use the results to select the smallest dimensionality that achieves near-minimum hold-out loss.

Usage

estimate_dimensionality(
  triplet_list,
  dims = 1:8,
  n_restarts = 10L,
  max_epochs = 50000L,
  tolerance = 1e-04,
  tol_window = 10000L,
  device = NULL,
  seed = 1L,
  verbose = TRUE
)

Arguments

triplet_list: A named list of data frames, one per participant, as returned by get.combined. Each data frame must contain columns Center, Left, Right, Answer, and sampleSet.
dims: Integer vector of dimensionalities to evaluate. Default 1:8.
n_restarts: Number of independent random restarts per dimensionality. Default 10L. More restarts give a more reliable loss estimate but multiply compute time.
max_epochs: Maximum training epochs per restart. Default 50000L.
tolerance: Loss tolerance for early stopping. Default 1e-4.
tol_window: Epochs without meaningful improvement before early stopping triggers. Default 10000L.
device: PyTorch device string, or NULL (default) to auto-select: CUDA GPU if available, then Apple MPS, then CPU.
seed: Base integer seed for reproducibility. Each (d, restart) pair receives a unique derived seed so all runs are independently replicable. Default 1L.
verbose: Logical. If TRUE (default), print a progress line before each restart. Ignored when running in parallel (output from worker processes is not forwarded to the main session).

Value

A named list with two elements:

results: Data frame with one row per (dimension, restart) and columns d, restart, loss, epoch.
summary: Data frame with one row per dimension and columns d, mean_loss, min_loss, sd_loss. The logical column best_d marks the smallest d within one standard error of the global minimum mean loss.

Parallelism

By default the function runs serially. If the future.apply package is installed, parallelism is controlled by setting a future plan before calling this function. Each (d, restart) pair becomes an independent future, so any backend supported by future works: local multicore, SLURM, HTCondor, etc. See the "Computing Triplet Embeddings" vignette for worked examples.

Method

For each value of d in dims and each restart, an independent embedding is trained from a fresh random initialisation (controlled by a deterministic seed derived from seed, d, and the restart index). The best test loss achieved during training is recorded.

The summary element of the return value includes a best_d column that applies the one-standard-error rule to the per-dimension mean loss: the smallest d whose mean loss is within one standard error of the global minimum mean loss is flagged as best_d = TRUE. This tends to favour parsimony when several dimensions achieve similar loss.

Item indexing

All unique item names in Center, Left, and Right across all participants are collected and sorted alphabetically; this sorted order defines the zero-based integer indices passed to the Python model.

Filtering

Trials with NA in the sampleSet column (attention-check trials) are excluded before fitting. The sampleSet column ("train" / "test") is used to split data for early stopping. If no sampleSet column is present or all values are NA, a 70/30 random train/test split is used instead.

Examples

if (FALSE) { # \dontrun{
# Serial (default)
dim_est <- estimate_dimensionality(
  triplet_list = icon_triplets,
  dims         = 1:6,
  n_restarts   = 5L,
  max_epochs   = 20000L,
  seed         = 42L
)

# Parallel: use 4 local cores (requires future.apply)
library(future)
plan(multisession, workers = 4)
dim_est <- estimate_dimensionality(icon_triplets, dims = 1:6, n_restarts = 10L)
plan(sequential)  # restore serial execution afterwards

# Summary with recommended dimensionality flagged
dim_est$summary

# Plot mean loss +/- 1 SD by dimension
s <- dim_est$summary
plot(s$d, s$mean_loss, type = "b", pch = 19,
     xlab = "Dimensions", ylab = "Mean test loss")
arrows(s$d, s$mean_loss - s$sd_loss, s$d, s$mean_loss + s$sd_loss,
       angle = 90, code = 3, length = 0.05)
abline(v = s$d[s$best_d], lty = 2)
} # }