Computing Triplet Embeddings

Overview

This vignette explains how to compute similarity embeddings from triplet judgment data collected with this package. Embeddings are computed by a Python backend that uses PyTorch to optimize the crowd kernel (CKL) noise model. From R you interact with this backend through three functions:

Function	What it does
`setup_python_env()`	One-time installation of the Python environment
`estimate_dimensionality()`	Fits embeddings across a range of dimensions to help choose `d`
`run_group_embedding_from_list()`	Trains a single embedding on all participants’ data combined
`run_embeddings_from_list()`	Trains one embedding per participant plus a group embedding

All functions accept data in the named-list format used throughout tripletTools (e.g. icon_triplets).

library(tripletTools)

Step 1: One-time Python environment setup

The embedding pipeline runs inside a self-contained conda environment. You only need to do this once after installing the package. It installs PyTorch, NumPy, pandas, scikit-learn, scipy, and skorch into an environment called "triplet-embeddings".

setup_python_env()

PyTorch is a large download (300–800 MB depending on your platform), so the first run may take several minutes. On future R sessions you do not need to call setup_python_env() again — the environment is detected and activated automatically when you load the package.

If conda is not found on your system, setup_python_env() will stop with a message directing you to the Miniconda installation page.

Step 2: Computing a group embedding

run_group_embedding_from_list() pools the judgments from all participants and trains a single embedding that captures the shared similarity structure across the group. This is the fastest option and is a natural starting point before computing individual embeddings.

grp <- run_group_embedding_from_list(
  triplet_list = icon_triplets,
  d            = 3L,       # number of dimensions
  max_epochs   = 50000L
)

The function prints training progress every 100 epochs, showing train loss, test loss, train accuracy, and test accuracy. Training stops early if the test loss has not improved meaningfully for 10,000 consecutive epochs.

What is returned

run_group_embedding_from_list() returns a named list:

str(grp)
#> List of 3
#>  $ embedding: num [1:32, 1:3] ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:32] "bdbwb" "bdbwb2" ...   # item names as row names
#>   .. ..$ : chr [1:3] "dim_0" "dim_1" "dim_2"
#>  $ loss     : num 0.412
#>  $ history  :'data.frame': 847 obs. of 5 variables:
#>   ..$ epoch     : int [1:847] 0 1 2 ...
#>   ..$ train_loss: num [1:847] ...
#>   ..$ test_loss : num [1:847] ...
#>   ..$ train_acc : num [1:847] ...
#>   ..$ test_acc  : num [1:847] ...

embedding — a matrix with one row per stimulus (item names as row names) and d columns (dim_0, dim_1, …).
loss — the best test loss achieved during training.
history — a data frame recording train/test loss and accuracy at each epoch, useful for diagnosing convergence.

Inspecting the result

# First few rows of the embedding
head(grp$embedding)

# Best test loss
cat("Best test loss:", round(grp$loss, 3), "\n")

# Did training converge? Plot the test loss curve.
plot(grp$history$epoch, grp$history$test_loss,
     type = "l", col = "steelblue",
     xlab = "Epoch", ylab = "Test loss",
     main = "Group embedding – training curve")

Visualizing the embedding

Once you have the group embedding, you can use plot_pics() to position stimulus images at their coordinates in the learned similarity space (first two dimensions shown):

plot_pics(grp$embedding[, 1:2], icon_pics,
          psize = 0.04,
          xlab  = "Dimension 1",
          ylab  = "Dimension 2",
          main  = "Group embedding")

Items that appear close together were frequently judged as similar by participants overall.

The group embedding can also be used with the full set of tripletTools analysis functions. For example, to evaluate how well it predicts held-out judgments for participant 1:

acc <- get.hoacc(grp$embedding, icon_triplets[[1]])
cat("Hold-out accuracy (participant 1):", round(acc, 3), "\n")

Step 3: Computing individual embeddings

run_embeddings_from_list() trains a separate embedding for each participant and then trains an additional group embedding on all participants’ data combined. This takes longer but lets you study individual differences in how participants represent the stimulus set.

results <- run_embeddings_from_list(
  triplet_list = icon_triplets,
  output_dir   = "embeddings_output",   # CSVs are also saved here
  d            = 3L,
  max_epochs   = 50000L
)

Output CSV files are written to output_dir as a side-effect:

File	Contents
`embeddings.csv`	All per-participant and group embeddings concatenated
`embeddings_group.csv`	Group embedding only
`model_history.csv`	Training diagnostics for each participant and the group

What is returned

run_embeddings_from_list() returns a named list:

str(results, max.level = 2)
#> List of 3
#>  $ individual:List of 6
#>   ..$ 3n7ggxph: num [1:32, 1:3] ...   # one matrix per participant
#>   ..$ b5wma4no: num [1:32, 1:3] ...
#>   ..$ d8mmm1qn: num [1:32, 1:3] ...
#>   ..$ jn7bbjc0: num [1:32, 1:3] ...
#>   ..$ pbby694o: num [1:32, 1:3] ...
#>   ..$ sc2xbd6w: num [1:32, 1:3] ...
#>  $ group     : num [1:32, 1:3] ...    # group embedding matrix
#>  $ history   :'data.frame': 7 obs. of 6 variables:
#>   ..$ worker_id              : chr [1:7] "3n7ggxph" ... "group"
#>   ..$ lowest_loss            : num [1:7] ...
#>   ..$ epoch                  : int [1:7] ...
#>   ..$ counter_from_last_update: int [1:7] ...
#>   ..$ n_train_triplets       : int [1:7] ...
#>   ..$ n_test_triplets        : int [1:7] ...

individual — a named list of matrices, one per participant, in the same format as grp$embedding above. Element names match the worker_id labels in your data.
group — a single matrix containing the group embedding.
history — a data frame with one row per participant plus one row for the group, summarising training diagnostics.

Inspecting training quality

The history data frame gives a quick overview of how well each participant’s model converged:

results$history[, c("worker_id", "lowest_loss", "epoch", "n_train_triplets")]

Participants who stopped early (small epoch) and achieved a low lowest_loss are well-modelled. If a participant’s lowest_loss is much higher than the others, their data may be noisier or they may have fewer trials.

Working with individual embeddings

The individual embeddings are stored in results$individual and have the same structure as the group embedding: a matrix with item row names and columns dim_0, dim_1, …. This means they are compatible with all existing tripletTools functions.

# Hold-out prediction accuracy for each participant using their own embedding
sapply(names(results$individual), function(wid) {
  get.hoacc(results$individual[[wid]], icon_triplets[[wid]])
})

# Prediction matrix: how well does each participant's embedding predict
# every other participant's held-out judgments?
pmat <- get.prediction.matrix(results$individual, icon_triplets)

# Do participants predict their own data better than others predict it?
cat("Mean self-prediction accuracy:  ", round(mean(diag(pmat)), 3), "\n")
cat("Mean other-prediction accuracy: ", round(mean(pmat[row(pmat) != col(pmat)]), 3), "\n")

Visualizing individual embeddings

emb1 <- results$individual[[1]]

plot_pics(emb1[, 1:2], icon_pics,
          psize = 0.04,
          xlab  = "Dimension 1",
          ylab  = "Dimension 2",
          main  = paste("Individual embedding –", names(results$individual)[1]))

Choosing between the two functions

	`run_group_embedding_from_list()`	`run_embeddings_from_list()`
Speed	Fast — one model	Slow — one model per participant + group
Output	Single group embedding	Per-participant + group embeddings
Best for	Quick summary of shared structure; pilot analyses	Studying individual differences; full analysis
Saved files	None	Three CSVs in `output_dir`

If you only need the group embedding, prefer run_group_embedding_from_list(). If you want to study whether participants differ in their representations — using tools like get.prediction.matrix(), get.rep.dist(), or pacc.by.cluster() — run run_embeddings_from_list() to obtain individual embeddings as well.

Controlling training

Both functions accept the same set of hyperparameters:

Parameter	Default	Meaning
`d`	`5`	Number of embedding dimensions
`max_epochs`	`50000`	Maximum training epochs
`tolerance`	`1e-4`	Minimum loss improvement to reset the early-stopping counter
`tol_window`	`10000`	Epochs without improvement before stopping
`seed`	`222`	Random seed for reproducibility

For exploratory analyses, the defaults are a reasonable starting point. For a final analysis you may want to increase max_epochs and tol_window to allow longer convergence, especially with large datasets or high d.

To inspect the training curve after fitting:

plot(grp$history$epoch, grp$history$test_loss,
     type = "l", col = "steelblue", lwd = 2,
     xlab = "Epoch", ylab = "Test loss",
     main = "Convergence check")
lines(grp$history$epoch, grp$history$train_loss,
      col = "tomato", lwd = 2, lty = 2)
legend("topright",
       legend = c("Test", "Train"),
       col    = c("steelblue", "tomato"),
       lty    = c(1, 2), lwd = 2)

A well-converged model shows test loss flattening well before max_epochs. If the loss is still falling at the end, increase max_epochs.

Choosing the number of dimensions

The number of embedding dimensions d is a hyperparameter that must be set before fitting. Too few dimensions underfit the similarity structure; too many overfit noise. The right approach is to treat d as an unknown, fit embeddings at several values, and choose the smallest d whose hold-out loss is not meaningfully higher than the best.

Because the optimizer is stochastic, hold-out loss varies across runs at the same d. A single fit per dimensionality is not reliable — a good run at d = 3 might beat a bad run at d = 4 by chance. estimate_dimensionality() addresses this by fitting multiple independent random restarts at each d and recording the best hold-out loss from each restart.

Running the grid

dim_est <- estimate_dimensionality(
  triplet_list = icon_triplets,
  dims         = 1:8,       # dimensionalities to evaluate
  n_restarts   = 10L,       # independent random restarts per d
  max_epochs   = 50000L,
  seed         = 42L
)

Progress messages are printed as each restart completes. The function returns a named list with two elements.

$results has one row per (dimension, restart):

head(dim_est$results)
#>   d restart      loss epoch
#> 1 1       1 0.6022...   ...
#> 2 1       2 0.6019...   ...
#> ...

$summary has one row per dimension, with the best_d flag identifying the recommended choice:

dim_est$summary
#>   d mean_loss  min_loss    sd_loss best_d
#> 1 1    0.601     0.601     0.001   FALSE
#> 2 2    0.512     0.510     0.003   FALSE
#> 3 3    0.448     0.445     0.004    TRUE   # <-- recommended
#> 4 4    0.447     0.443     0.005   FALSE
#> ...

How `best_d` is determined

The recommendation uses the one-standard-error rule: find the dimensionality with the lowest mean hold-out loss, compute one standard error of that mean (i.e. sd_loss / sqrt(n_restarts)), and then flag the smallest d whose mean loss falls within that margin. This favours parsimony — it selects the simplest model that is statistically indistinguishable from the best.

Visualizing the results

A simple plot shows mean loss by dimension with ±1 SD error bars. The vertical dashed line marks the recommended d.

s <- dim_est$summary

plot(s$d, s$mean_loss,
     type = "b", pch = 19,
     xlab = "Dimensions (d)",
     ylab = "Mean test loss",
     main = "Hold-out loss by dimensionality")

arrows(s$d, s$mean_loss - s$sd_loss,
       s$d, s$mean_loss + s$sd_loss,
       angle = 90, code = 3, length = 0.05, col = "gray50")

abline(v = s$d[s$best_d], lty = 2, col = "steelblue")

Look for an elbow: a point where increasing d no longer reduces loss substantially. The one-SE rule formalises this by returning the smallest d in the flat region of the curve.

Practical notes

The function can be slow for large datasets or many restarts because it trains length(dims) × n_restarts independent models. A few suggestions:

Start with a coarse grid (e.g. dims = c(1, 2, 3, 5, 8)) and n_restarts = 5 to get a rough picture, then refine around the elbow.
Use max_epochs and tol_window values consistent with the values you plan to use for the final embedding.
If a GPU is available, pass device = "cuda" (or "mps" on Apple Silicon) to speed up each individual fit.
estimate_dimensionality() trains only a group embedding (pooling all participants) at each d. This is usually the right choice for selecting d, and it is faster than fitting per-participant models.
Each (d, restart) pair is independent, so the function can be parallelised — see the next section.

Running in parallel

Because every (d, restart) fit is completely independent, estimate_dimensionality() can distribute work across multiple cores or a compute cluster with no changes to your analysis code. Parallelism is controlled via the future framework: you set a plan once before calling the function, and the function uses however many workers that plan provides.

Install the required packages once:

install.packages(c("future", "future.apply"))

Local multicore

Use all available cores on your workstation (or a subset):

library(future)
plan(multisession, workers = 4)   # or: workers = parallel::detectCores() - 1

dim_est <- estimate_dimensionality(
  triplet_list = icon_triplets,
  dims         = 1:8,
  n_restarts   = 10L,
  max_epochs   = 50000L,
  seed         = 42L
)

plan(sequential)   # restore serial execution when done

With 4 workers and 80 total fits (8 dims × 10 restarts), wall time drops to roughly one quarter of the serial time.

multisession launches separate R processes, so it works on Windows, macOS, and Linux. If you are on Linux or macOS and prefer lower overhead, plan(multicore) uses forked processes instead.

HTCondor cluster

If you have access to an HTCondor cluster (e.g. UW–Madison’s CHTC), install the future.batchtools package, which bridges future to HTCondor’s job scheduler:

install.packages("future.batchtools")

You also need a one-time cluster template file that tells future.batchtools how to construct HTCondor submit files for your site. Save the following as ~/.batchtools.condor.tmpl (adjust request_memory and any site-specific lines for your cluster):

universe     = vanilla
executable   = '/opt/R/4.6.0/lib/R/bin/Rscript'
arguments    = -e 'batchtools::doJobCollection("$(job.collection)")'

transfer_input_files  = $(job.collection)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

request_cpus   = 1
request_memory = 4GB
request_disk   = 2GB

log    = $(log)
output = $(output)
error  = $(error)

queue

Then submit your dimensionality search exactly as before, just with a different plan:

library(future.batchtools)

plan(batchtools_condor, workers = 80)   # up to 80 simultaneous jobs

dim_est <- estimate_dimensionality(
  triplet_list = icon_triplets,
  dims         = 1:8,
  n_restarts   = 10L,
  max_epochs   = 50000L,
  seed         = 42L
)

plan(sequential)

future.batchtools handles job submission, monitors completion, retrieves results, and assembles them — you get back the same dim_est list as in the serial case.

Prerequisite: every worker node must have R, tripletTools, and the triplet-embeddings conda environment already installed. This is a one-time setup task; ask your cluster administrator or follow your site’s software installation guide.

Checking the active plan

At any point you can confirm which plan is in effect:

library(future)
plan()   # prints a description of the current plan

plan(sequential) (the default) runs everything serially and requires neither future nor future.apply — all other tripletTools functions are unaffected.

Overview

Step 1: One-time Python environment setup

Step 2: Computing a group embedding

What is returned

Inspecting the result

Visualizing the embedding

Step 3: Computing individual embeddings

What is returned

Inspecting training quality

Working with individual embeddings

Visualizing individual embeddings

Choosing between the two functions

Controlling training

Choosing the number of dimensions

Running the grid

How best_d is determined

Visualizing the results

Practical notes

Running in parallel

Local multicore

HTCondor cluster

Checking the active plan

How `best_d` is determined