Computing Triplet Embeddings
embedding_vignette.RmdOverview
This vignette explains how to compute similarity embeddings from triplet judgment data collected with this package. Embeddings are computed by a Python backend that uses PyTorch to optimize the crowd kernel (CKL) noise model. From R you interact with this backend through three functions:
| Function | What it does |
|---|---|
setup_python_env() |
One-time installation of the Python environment |
estimate_dimensionality() |
Fits embeddings across a range of dimensions to help choose
d
|
run_group_embedding_from_list() |
Trains a single embedding on all participants’ data combined |
run_embeddings_from_list() |
Trains one embedding per participant plus a group embedding |
All functions accept data in the named-list format used throughout
tripletTools (e.g. icon_triplets).
Step 1: One-time Python environment setup
The embedding pipeline runs inside a self-contained conda
environment. You only need to do this once after installing the package.
It installs PyTorch, NumPy, pandas, scikit-learn, scipy, and skorch into
an environment called "triplet-embeddings".
PyTorch is a large download (300–800 MB depending on your platform),
so the first run may take several minutes. On future R sessions you do
not need to call setup_python_env() again
— the environment is detected and activated automatically when you load
the package.
If conda is not found on your system, setup_python_env()
will stop with a message directing you to the Miniconda installation
page.
Step 2: Computing a group embedding
run_group_embedding_from_list() pools the judgments from
all participants and trains a single embedding that captures the shared
similarity structure across the group. This is the fastest option and is
a natural starting point before computing individual embeddings.
grp <- run_group_embedding_from_list(
triplet_list = icon_triplets,
d = 3L, # number of dimensions
max_epochs = 50000L
)The function prints training progress every 100 epochs, showing train loss, test loss, train accuracy, and test accuracy. Training stops early if the test loss has not improved meaningfully for 10,000 consecutive epochs.
What is returned
run_group_embedding_from_list() returns a named
list:
str(grp)
#> List of 3
#> $ embedding: num [1:32, 1:3] ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:32] "bdbwb" "bdbwb2" ... # item names as row names
#> .. ..$ : chr [1:3] "dim_0" "dim_1" "dim_2"
#> $ loss : num 0.412
#> $ history :'data.frame': 847 obs. of 5 variables:
#> ..$ epoch : int [1:847] 0 1 2 ...
#> ..$ train_loss: num [1:847] ...
#> ..$ test_loss : num [1:847] ...
#> ..$ train_acc : num [1:847] ...
#> ..$ test_acc : num [1:847] ...-
embedding— a matrix with one row per stimulus (item names as row names) anddcolumns (dim_0,dim_1, …). -
loss— the best test loss achieved during training. -
history— a data frame recording train/test loss and accuracy at each epoch, useful for diagnosing convergence.
Inspecting the result
# First few rows of the embedding
head(grp$embedding)
# Best test loss
cat("Best test loss:", round(grp$loss, 3), "\n")
# Did training converge? Plot the test loss curve.
plot(grp$history$epoch, grp$history$test_loss,
type = "l", col = "steelblue",
xlab = "Epoch", ylab = "Test loss",
main = "Group embedding – training curve")Visualizing the embedding
Once you have the group embedding, you can use
plot_pics() to position stimulus images at their
coordinates in the learned similarity space (first two dimensions
shown):
plot_pics(grp$embedding[, 1:2], icon_pics,
psize = 0.04,
xlab = "Dimension 1",
ylab = "Dimension 2",
main = "Group embedding")Items that appear close together were frequently judged as similar by participants overall.
The group embedding can also be used with the full set of
tripletTools analysis functions. For example, to evaluate
how well it predicts held-out judgments for participant 1:
acc <- get.hoacc(grp$embedding, icon_triplets[[1]])
cat("Hold-out accuracy (participant 1):", round(acc, 3), "\n")Step 3: Computing individual embeddings
run_embeddings_from_list() trains a separate embedding
for each participant and then trains an additional group embedding on
all participants’ data combined. This takes longer but lets you study
individual differences in how participants represent the stimulus
set.
results <- run_embeddings_from_list(
triplet_list = icon_triplets,
output_dir = "embeddings_output", # CSVs are also saved here
d = 3L,
max_epochs = 50000L
)Output CSV files are written to output_dir as a
side-effect:
| File | Contents |
|---|---|
embeddings.csv |
All per-participant and group embeddings concatenated |
embeddings_group.csv |
Group embedding only |
model_history.csv |
Training diagnostics for each participant and the group |
What is returned
run_embeddings_from_list() returns a named list:
str(results, max.level = 2)
#> List of 3
#> $ individual:List of 6
#> ..$ 3n7ggxph: num [1:32, 1:3] ... # one matrix per participant
#> ..$ b5wma4no: num [1:32, 1:3] ...
#> ..$ d8mmm1qn: num [1:32, 1:3] ...
#> ..$ jn7bbjc0: num [1:32, 1:3] ...
#> ..$ pbby694o: num [1:32, 1:3] ...
#> ..$ sc2xbd6w: num [1:32, 1:3] ...
#> $ group : num [1:32, 1:3] ... # group embedding matrix
#> $ history :'data.frame': 7 obs. of 6 variables:
#> ..$ worker_id : chr [1:7] "3n7ggxph" ... "group"
#> ..$ lowest_loss : num [1:7] ...
#> ..$ epoch : int [1:7] ...
#> ..$ counter_from_last_update: int [1:7] ...
#> ..$ n_train_triplets : int [1:7] ...
#> ..$ n_test_triplets : int [1:7] ...-
individual— a named list of matrices, one per participant, in the same format asgrp$embeddingabove. Element names match theworker_idlabels in your data. -
group— a single matrix containing the group embedding. -
history— a data frame with one row per participant plus one row for the group, summarising training diagnostics.
Inspecting training quality
The history data frame gives a quick overview of how
well each participant’s model converged:
results$history[, c("worker_id", "lowest_loss", "epoch", "n_train_triplets")]Participants who stopped early (small epoch) and
achieved a low lowest_loss are well-modelled. If a
participant’s lowest_loss is much higher than the others,
their data may be noisier or they may have fewer trials.
Working with individual embeddings
The individual embeddings are stored in
results$individual and have the same structure as the group
embedding: a matrix with item row names and columns dim_0,
dim_1, …. This means they are compatible with all existing
tripletTools functions.
# Hold-out prediction accuracy for each participant using their own embedding
sapply(names(results$individual), function(wid) {
get.hoacc(results$individual[[wid]], icon_triplets[[wid]])
})
# Prediction matrix: how well does each participant's embedding predict
# every other participant's held-out judgments?
pmat <- get.prediction.matrix(results$individual, icon_triplets)
# Do participants predict their own data better than others predict it?
cat("Mean self-prediction accuracy: ", round(mean(diag(pmat)), 3), "\n")
cat("Mean other-prediction accuracy: ", round(mean(pmat[row(pmat) != col(pmat)]), 3), "\n")Choosing between the two functions
run_group_embedding_from_list() |
run_embeddings_from_list() |
|
|---|---|---|
| Speed | Fast — one model | Slow — one model per participant + group |
| Output | Single group embedding | Per-participant + group embeddings |
| Best for | Quick summary of shared structure; pilot analyses | Studying individual differences; full analysis |
| Saved files | None | Three CSVs in output_dir
|
If you only need the group embedding, prefer
run_group_embedding_from_list(). If you want to study
whether participants differ in their representations — using tools like
get.prediction.matrix(), get.rep.dist(), or
pacc.by.cluster() — run
run_embeddings_from_list() to obtain individual embeddings
as well.
Controlling training
Both functions accept the same set of hyperparameters:
| Parameter | Default | Meaning |
|---|---|---|
d |
5 |
Number of embedding dimensions |
max_epochs |
50000 |
Maximum training epochs |
tolerance |
1e-4 |
Minimum loss improvement to reset the early-stopping counter |
tol_window |
10000 |
Epochs without improvement before stopping |
seed |
222 |
Random seed for reproducibility |
For exploratory analyses, the defaults are a reasonable starting
point. For a final analysis you may want to increase
max_epochs and tol_window to allow longer
convergence, especially with large datasets or high d.
To inspect the training curve after fitting:
plot(grp$history$epoch, grp$history$test_loss,
type = "l", col = "steelblue", lwd = 2,
xlab = "Epoch", ylab = "Test loss",
main = "Convergence check")
lines(grp$history$epoch, grp$history$train_loss,
col = "tomato", lwd = 2, lty = 2)
legend("topright",
legend = c("Test", "Train"),
col = c("steelblue", "tomato"),
lty = c(1, 2), lwd = 2)A well-converged model shows test loss flattening well before
max_epochs. If the loss is still falling at the end,
increase max_epochs.
Choosing the number of dimensions
The number of embedding dimensions d is a hyperparameter
that must be set before fitting. Too few dimensions underfit the
similarity structure; too many overfit noise. The right approach is to
treat d as an unknown, fit embeddings at several values,
and choose the smallest d whose hold-out loss is not
meaningfully higher than the best.
Because the optimizer is stochastic, hold-out loss varies across runs
at the same d. A single fit per dimensionality is not
reliable — a good run at d = 3 might beat a bad run at
d = 4 by chance. estimate_dimensionality()
addresses this by fitting multiple independent random restarts at each
d and recording the best hold-out loss from each
restart.
Running the grid
dim_est <- estimate_dimensionality(
triplet_list = icon_triplets,
dims = 1:8, # dimensionalities to evaluate
n_restarts = 10L, # independent random restarts per d
max_epochs = 50000L,
seed = 42L
)Progress messages are printed as each restart completes. The function returns a named list with two elements.
$results has one row per (dimension, restart):
head(dim_est$results)
#> d restart loss epoch
#> 1 1 1 0.6022... ...
#> 2 1 2 0.6019... ...
#> ...$summary has one row per dimension, with the
best_d flag identifying the recommended choice:
dim_est$summary
#> d mean_loss min_loss sd_loss best_d
#> 1 1 0.601 0.601 0.001 FALSE
#> 2 2 0.512 0.510 0.003 FALSE
#> 3 3 0.448 0.445 0.004 TRUE # <-- recommended
#> 4 4 0.447 0.443 0.005 FALSE
#> ...How best_d is determined
The recommendation uses the one-standard-error rule:
find the dimensionality with the lowest mean hold-out loss, compute one
standard error of that mean
(i.e. sd_loss / sqrt(n_restarts)), and then flag the
smallest d whose mean loss falls within that
margin. This favours parsimony — it selects the simplest model that is
statistically indistinguishable from the best.
Visualizing the results
A simple plot shows mean loss by dimension with ±1 SD error bars. The
vertical dashed line marks the recommended d.
s <- dim_est$summary
plot(s$d, s$mean_loss,
type = "b", pch = 19,
xlab = "Dimensions (d)",
ylab = "Mean test loss",
main = "Hold-out loss by dimensionality")
arrows(s$d, s$mean_loss - s$sd_loss,
s$d, s$mean_loss + s$sd_loss,
angle = 90, code = 3, length = 0.05, col = "gray50")
abline(v = s$d[s$best_d], lty = 2, col = "steelblue")Look for an elbow: a point where increasing
d no longer reduces loss substantially. The one-SE rule
formalises this by returning the smallest d in the flat
region of the curve.
Practical notes
The function can be slow for large datasets or many restarts because
it trains length(dims) × n_restarts independent models. A
few suggestions:
- Start with a coarse grid (e.g.
dims = c(1, 2, 3, 5, 8)) andn_restarts = 5to get a rough picture, then refine around the elbow. - Use
max_epochsandtol_windowvalues consistent with the values you plan to use for the final embedding. - If a GPU is available, pass
device = "cuda"(or"mps"on Apple Silicon) to speed up each individual fit. -
estimate_dimensionality()trains only a group embedding (pooling all participants) at eachd. This is usually the right choice for selectingd, and it is faster than fitting per-participant models. - Each (d, restart) pair is independent, so the function can be parallelised — see the next section.
Running in parallel
Because every (d, restart) fit is completely independent,
estimate_dimensionality() can distribute work across
multiple cores or a compute cluster with no changes to your analysis
code. Parallelism is controlled via the future
framework: you set a plan once before calling the function, and
the function uses however many workers that plan provides.
Install the required packages once:
install.packages(c("future", "future.apply"))Local multicore
Use all available cores on your workstation (or a subset):
library(future)
plan(multisession, workers = 4) # or: workers = parallel::detectCores() - 1
dim_est <- estimate_dimensionality(
triplet_list = icon_triplets,
dims = 1:8,
n_restarts = 10L,
max_epochs = 50000L,
seed = 42L
)
plan(sequential) # restore serial execution when doneWith 4 workers and 80 total fits (8 dims × 10 restarts), wall time drops to roughly one quarter of the serial time.
multisession launches separate R processes, so it works
on Windows, macOS, and Linux. If you are on Linux or macOS and prefer
lower overhead, plan(multicore) uses forked processes
instead.
HTCondor cluster
If you have access to an HTCondor cluster (e.g. UW–Madison’s CHTC),
install the future.batchtools package, which bridges
future to HTCondor’s job scheduler:
install.packages("future.batchtools")You also need a one-time cluster template file that tells
future.batchtools how to construct HTCondor submit files
for your site. Save the following as
~/.batchtools.condor.tmpl (adjust
request_memory and any site-specific lines for your
cluster):
universe = vanilla
executable = '/opt/R/4.6.0/lib/R/bin/Rscript'
arguments = -e 'batchtools::doJobCollection("$(job.collection)")'
transfer_input_files = $(job.collection)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
request_cpus = 1
request_memory = 4GB
request_disk = 2GB
log = $(log)
output = $(output)
error = $(error)
queue
Then submit your dimensionality search exactly as before, just with a different plan:
library(future.batchtools)
plan(batchtools_condor, workers = 80) # up to 80 simultaneous jobs
dim_est <- estimate_dimensionality(
triplet_list = icon_triplets,
dims = 1:8,
n_restarts = 10L,
max_epochs = 50000L,
seed = 42L
)
plan(sequential)future.batchtools handles job submission, monitors
completion, retrieves results, and assembles them — you get back the
same dim_est list as in the serial case.
Prerequisite: every worker node must have R,
tripletTools, and the triplet-embeddings conda
environment already installed. This is a one-time setup task; ask your
cluster administrator or follow your site’s software installation
guide.