Cross Validation

CPPLS is a supervised method, so it is always at risk of learning structure that is only accidentally aligned with the response. Cross-validation is used to test whether the relationship learned during fitting also generalizes to samples that were not used to fit the model. In CPPLS, cross-validation also serves a second purpose: selecting the number of latent variables. These two tasks are coupled, because model complexity has a direct effect on apparent predictive performance.

The package implements explicit nested cross-validation. The outer loop is used for performance assessment, whereas the inner loop is used for model selection. This keeps the choice of the number of latent variables separated from the final evaluation of the model and reduces optimistic bias.

How Nested Cross-Validation Is Implemented in CPPLS

nestedcv uses disjoint outer folds for performance assessment and disjoint inner folds for model selection. For each outer repeat, one fold is held out as a test set and the remaining samples are used for training. Within that outer training set, an inner cross-validation determines the number of latent variables. A final model is then fitted on the full outer training set with the selected number of components and applied to the outer test set.

For each inner fold, CPPLS evaluates all component counts from 1:max_components and selects the best one with select_fn. The final component count for the current outer repeat is the median of those inner-fold selections, rounded down to an integer. With the default selectors, ties are resolved in favor of the smaller component count.

If strata are supplied, fold construction is stratified so that class proportions are approximately preserved across folds. Within a given fold construction, the folds are non-overlapping. What can change is only whether the outer folds are reused or rebuilt between repeats: with reshuffle_outer_folds=false, the same outer partition is reused, whereas with reshuffle_outer_folds=true, a new outer partition is drawn for each repeat.

In conclusion, the outer scores measure prediction on samples excluded from fitting, while the inner loop controls model complexity. For classification, CPPLS.cv_classification provides an accuracy-like score based on one-hot predictions and normalized misclassification cost. For regression, CPPLS.cv_regression provides the corresponding callbacks, using root mean squared error by default.

Permutation-Based Significance Assessment

Even a carefully cross-validated supervised model can appear predictive if the response is weakly structured, the sample size is small, or the predictor matrix is high-dimensional. To address that question, CPPLS provides nestedcvperm, which places the entire nested cross-validation workflow inside a permutation test.

The idea is to destroy the real link between predictors and response while keeping the predictor matrix itself unchanged. CPPLS does that by permuting the rows of Y. After permutation:

  1. The same nested CV procedure is rerun.
  2. The mean outer-fold score from that permuted run is stored.
  3. This is repeated many times to build a null distribution of scores expected when the correspondence between X and Y is random.

This is useful because it tests the whole modeling pipeline, not just a single fitted model. The null distribution therefore includes the effect of repeated fold splitting, inner-loop model selection, and final outer-loop evaluation. If the score from the real, unpermuted data lies well outside what is typical under permutation, the result is more consistent with genuine predictive structure than with chance alignment. The helper pvalue can then be used to summarize the observed score relative to the permutation distribution. This comparison is only valid if the score from the real data is aggregated in exactly the same way, that is, as the mean of the outer-fold scores.

When strata are supplied, the same permutation applied to Y is also applied to the stratification vector before nested CV is rerun. This keeps the stratified fold generation consistent with the permuted labels during each null-model evaluation.

DA-Specific Interpretation and Outlier Scanning

For discriminant analysis, ordinary cross-validation implicitly treats the provided class labels as ground truth. That means the assessment assumes that every sample assigned to a class is in fact a correct representative of that class. In controlled benchmark datasets that assumption may be reasonable, but in experimental settings it is not guaranteed. Samples can be mislabeled, contaminated, technically compromised, or simply atypical in a way that makes them behave poorly under the fitted model.

For that reason, CPPLS also provides outlierscan. This routine is not a replacement for nested CV; instead, it is a diagnostic extension for discriminant analysis. It repeatedly:

  1. builds an outer training/test split,
  2. selects the number of latent variables within the training data,
  3. predicts the held-out samples, and
  4. records which test samples were misclassified.

Across repeated outer folds, each sample accumulates two counts:

  1. n_tested: how many times that sample appeared in an outer test set,
  2. n_flagged: how many of those appearances led to a wrong class assignment.

The ratio n_flagged ./ n_tested is returned as rate. Samples with elevated rates are not automatically mislabeled, but they are candidates for closer inspection. In practice, such samples may indicate possible class-assignment problems, outlying biology, unusual measurement behavior, or incompatibility between the sample and the fitted latent structure.

This makes outlierscan especially useful in experimental DA applications, where the goal is often not only to estimate classification performance but also to identify samples that deserve follow-up quality control.

Example

We again use the synthetic discriminant-analysis dataset introduced on the Fit page. The goal here is to estimate predictive performance with nested cross-validation, compare that performance against a permutation-based null distribution, and then inspect which samples are most often misclassified across repeated outer folds.

To keep the documentation example reasonably fast, we use a fixed gamma=0.5 and allow at most two latent variables. For a real analysis, those settings should be chosen more carefully.

In this example, we use the higher-level discriminant-analysis wrappers cvda and permda. These functions apply the standard DA defaults automatically: they use the callback bundle from CPPLS.cv_classification, recompute inverse-frequency observation weights inside each training fold, derive the stratification from the class labels, and inject responselabels into fit_kwargs when needed. If you instead want fixed sample-specific weights, pass them through fit_kwargs=(; obs_weights=...) and they will simply be subsetted to each training split.

The packages loaded below play different roles: CPPLS provides the modeling and cross-validation functions, JLD2 reads the example dataset from disk, Random provides a reproducible RNG, Statistics provides summary functions such as mean, and CairoMakie is used to draw and save the histogram. In a normal Julia environment, packages such as CPPLS, JLD2, and CairoMakie must be installed before running the example; the Julia Pkg documentation explains how to install registered packages in the Getting Started section.

using CPPLS
using JLD2
using Random
using Statistics
using CairoMakie

samplelabels, X, classes, Y_aux = load(
    CPPLS.dataset("synthetic_cppls_da_dataset.jld2"),
    "sample_labels",
    "X",
    "classes",
    "Y_aux"
)

m = CPPLSModel(
    ncomponents=2,
    gamma=intervalize(0:0.25:1),
    mode=:discriminant,
    center_X=true,
    scale_X=true
)

fit_kwargs = (
    Yaux=Y_aux,
    samplelabels=samplelabels
)

scores, best_components = cvda(
    X,
    classes;
    spec=m,
    fit_kwargs=fit_kwargs,
    num_outer_folds=5,
    num_outer_folds_repeats=5,
    num_inner_folds=4,
    num_inner_folds_repeats=4,
    max_components=2,
    rng=MersenneTwister(12345),
    verbose=false
)

observed_accuracy = mean(scores)
0.9111111111111111

The returned scores vector contains the outer-fold accuracies, and the best_components vector contains the component counts selected inside the outer repeats. The value observed_accuracy is the mean nested-CV score on the real labels.

That mean accuracy is not especially impressive in absolute terms, because it is only moderately better than random guessing. This raises the question of whether the result is nonetheless significantly better than chance. To answer that, we compare it with a null distribution obtained from permuted data in which the correspondence between predictors and class labels has been broken. For a fair comparison, the permutation procedure uses exactly the same settings as cvda, here with a total of 100 permutations.

permutation_scores = permda(
    X,
    classes;
    spec=m,
    fit_kwargs=fit_kwargs,
    num_permutations=100,
    num_outer_folds=5,
    num_outer_folds_repeats=5,
    num_inner_folds=4,
    num_inner_folds_repeats=4,
    max_components=2,
    rng=MersenneTwister(54321),
    verbose=false,
)
100-element Vector{Float64}:
 0.5388888888888888
 0.27222222222222214
 0.5499999999999999
 0.41666666666666663
 0.4277777777777777
 0.6888888888888889
 0.44444444444444436
 0.5166666666666666
 0.41666666666666663
 0.5388888888888888
 ⋮
 0.511111111111111
 0.5833333333333334
 0.49444444444444435
 0.44444444444444436
 0.5333333333333332
 0.6277777777777778
 0.7333333333333333
 0.5499999999999999
 0.44444444444444436

The permutation_scores vector contains the mean outer-fold accuracies for each of the 100 permutations. Let us visualize that distribution.

f = Figure(; size=(900, 600))
ax = Axis(
    f[1, 1],
    title="Model accuracy null distribution",
    xlabel="Mean outer-fold accuracy",
    ylabel="Count across permutations"
)
hist!(ax, permutation_scores, bins=20)
save("accuracy_hist.svg", f)
nothing

As we can see, the permutation accuracies are mostly distributed around and below 0.55, with only few being as large as or larger than the observed accuracy from the real data. This shows that even a modest mean accuracy can still be statistically significant. We can quantify that more formally with pvalue, using the permutation scores and the observed accuracy. Because classification uses an accuracy-like score here, pvalue is called with tail=:upper, corresponding to a one-sided upper-tail test in which the p-value is the probability, under the null model, of observing a score at least as large as the observed one.

p_value = pvalue(permutation_scores, observed_accuracy; tail=:upper)
0.009900990099009901

In this example, the resulting p-value indicates statistical significance at an alpha = 0.05 threshold.

Note that the assessment above was comparatively inexpensive because X is small and we used a fixed gamma value. With datasets containing hundreds of samples and thousands of traits, especially when gamma is optimized, the computation becomes much more demanding. In that situation it can make sense to distribute the permutation runs. For example, one could run 50 separate calls to CPPLS.permda on 20 different nodes, then concatenate the stored permutation_scores vectors before passing them to pvalue.

Warning

When permutation runs are distributed across multiple jobs or nodes, each run should be started with a different RNG seed. Reusing the same seed can lead to overlapping permutation sequences and therefore to a biased null distribution.

When the focus shifts from global performance to potentially problematic samples, outlierscan can be used as a follow-up diagnostic. By default it uses the same fold-local inverse-frequency weighting rule as cvda and permda, although you can still override that behavior with a custom obs_weight_fn or disable it by passing obs_weight_fn=nothing.

outlier_scan = outlierscan(
    X,
    classes;
    spec=m,
    fit_kwargs=(; Yaux=Y_aux, samplelabels=samplelabels),
    num_outer_folds=5,
    num_outer_folds_repeats=100,
    num_inner_folds=4,
    num_inner_folds_repeats=4,
    max_components=2,
    rng=MersenneTwister(54321),
    verbose=false,
)
(n_tested = [15, 20, 16, 21, 21, 21, 13, 20, 24, 16  …  19, 18, 20, 18, 25, 19, 21, 15, 28, 17], n_flagged = [0, 0, 0, 0, 0, 0, 0, 0, 0, 6  …  8, 0, 0, 0, 0, 0, 0, 1, 0, 0], rate = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.375  …  0.42105263157894735, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06666666666666667, 0.0, 0.0])

outlier_scan is a named tuple with the fields n_tested, n_flagged, and rate. n_tested specifies how often a given sample was evaluated, n_flagged specifies how often that sample was misclassified, and rate is the ratio of the two. Most users will primarily be interested in rate. Let us inspect the samples with the largest rates, rounded to three decimal places and sorted in descending order.

suspect_idx = sortperm(outlier_scan.rate, rev=true)[1:15]
rate = round.(outlier_scan.rate[suspect_idx]; digits=3)
for (j, i) in enumerate(suspect_idx)
    println("Sample: ", samplelabels[i], "; error rate: ", rate[j])
end
Sample: 74; error rate: 0.941
Sample: 41; error rate: 0.889
Sample: 82; error rate: 0.85
Sample: 84; error rate: 0.81
Sample: 18; error rate: 0.75
Sample: 85; error rate: 0.619
Sample: 52; error rate: 0.607
Sample: 81; error rate: 0.571
Sample: 73; error rate: 0.55
Sample: 86; error rate: 0.476
Sample: 15; error rate: 0.429
Sample: 91; error rate: 0.421
Sample: 36; error rate: 0.412
Sample: 13; error rate: 0.4
Sample: 10; error rate: 0.375

Often it is useful to see those rates in the fitted score space rather than only as a sorted text list. Following the plotting pattern used on the Fit page, we fit a two-component CPPLS-DA model to the full dataset with the same inverse-frequency observation weighting used by the DA cross-validation wrappers, and then overlay the samples whose outlier-scan error rate is 1.00. In this synthetic example, those are the samples that consistently fail across repeated outer-fold predictions.

major_outlier_threshold = 0.5
major_outlier_idx = findall(>=(major_outlier_threshold), outlier_scan.rate)
class_weights = invfreqweights(classes)

outlier_view_model = fit(
    m,
    X,
    classes;
    obs_weights=class_weights,
    Yaux=Y_aux,
    samplelabels=samplelabels
)

outlier_view_scores = xscores(outlier_view_model, 1:2)

figure_kwargs = (; size=(900, 600))

outlier_fig = Figure(size=figure_kwargs.size)
outlier_ax = Axis(
    outlier_fig[1, 1],
    title="CPPLS-DA score plot with repeated outlier-scan failures highlighted",
    xlabel="Latent Variable 1",
    ylabel="Latent Variable 2"
)

scoreplot(
    samplelabels,
    classes,
    outlier_view_scores;
    backend=:makie,
    figure=outlier_fig,
    axis=outlier_ax,
    show_legend=false,
    default_marker=(; markersize=14)
)

major_outlier_scores = outlier_view_scores[major_outlier_idx, :]
x_span = maximum(outlier_view_scores[:, 1]) - minimum(outlier_view_scores[:, 1])
y_span = maximum(outlier_view_scores[:, 2]) - minimum(outlier_view_scores[:, 2])
label_dx = 0.02 * x_span
label_dy = 0.02 * y_span

scatter!(
    outlier_ax,
    major_outlier_scores[:, 1],
    major_outlier_scores[:, 2];
    color=(:crimson, 0.18),
    strokecolor=:crimson,
    strokewidth=2,
    marker=:circle,
    markersize=24,
    label="error rate ≥ 0.75"
)

text!(
    outlier_ax,
    major_outlier_scores[:, 1] .+ label_dx,
    major_outlier_scores[:, 2] .+ label_dy;
    text=samplelabels[major_outlier_idx],
    color=:crimson,
    fontsize=14,
    align=(:left, :bottom)
)

axislegend(outlier_ax; position=:rb)
save("outlier_scoreplot.svg", outlier_fig)
nothing

This view makes the diagnostic easier to interpret because it shows where the repeatedly flagged samples fall in the fitted score space. In this synthetic dataset, all highlighted samples belong to the minority class. That pattern is likely driven by class imbalance and the resulting instability of fold-wise classification, rather than by true mislabeling. In the current example, none of these samples is actually mislabeled, so the plot should be read as a visualization of which samples are hardest to classify consistently. The outlier-scan rates are therefore best understood as a practical way to prioritize follow-up inspection, not as proof that the highlighted samples are genuine outliers or incorrectly labeled.

The functions discussed above are documented in full below.

API

CPPLS.pvalueFunction
pvalue(
    null_scores::AbstractVector{<:Real},
    observed_score::Real;
    tail::Symbol=:upper
)

Compute a one-sided empirical p-value for an observed_score relative to a null distribution of scores. The input null_scores is the vector of scores from null-model or reference runs, typically obtained from label-shuffled permutations, and observed_score is the score achieved by the model fit to the original data. The argument tail selects the direction of the one-sided test and must be either :upper or :lower. In both cases, a +1 correction is applied to the numerator and denominator to account for the observed score itself in the empirical null ranking (Phipson & Smyth 2010). With tail=:upper, the p-value is the fraction of null scores greater than or numerically equal to the observed score, corresponding to a one-sided upper-tail test appropriate when larger scores indicate stronger evidence against the null. With tail=:lower, the comparison is reversed, corresponding to a one-sided lower-tail test appropriate when smaller scores indicate stronger evidence against the null.

Reference Phipson B, Smyth GK (2010): Permutation P-values should never be zero: Calculating exact P-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology 9: 39. https://doi.org/10.2202/1544-6115.1585

See also nestedcv, nestedcvperm

Example

julia> pvalue([0.4, 0.5, 0.55, 0.6], 0.58) ≈ 0.4
true

julia> pvalue([0.4, 0.5, 0.55, 0.6], 0.58; tail=:lower) ≈ 0.8
true
source
CPPLS.cvdaFunction
cvda(
    X::AbstractMatrix{<:Real},
    Y::AbstractMatrix{<:Real};
    spec::CPPLSModel,
    fit_kwargs::NamedTuple=(;),
    weighted::Bool=true,
    num_outer_folds::Integer=8,
    num_outer_folds_repeats::Integer=num_outer_folds,
    num_inner_folds::Integer=7,
    num_inner_folds_repeats::Integer=num_inner_folds,
    max_components::Integer=spec.ncomponents,
    reshuffle_outer_folds::Bool=false,
    rng::AbstractRNG=Random.GLOBAL_RNG,
    verbose::Bool=true,
)

Run nested cross-validation for CPPLS discriminant analysis with the standard DA defaults wired in automatically. This wrapper is a convenience layer over nestedcv that fixes the classification callbacks, recomputes inverse-frequency observation weights inside each training split, and derives stratified folds from Y.

Internally, cvda uses the same callback bundle returned by CPPLS.cv_classification() and the same fold-local weighting rule commonly used in CPPLS-DA workflows: obs_weight_fn(X_train, Y_train; kwargs...) = invfreqweights(sampleclasses(Y_train)). The stratification vector is always sampleclasses(Y).

Use nestedcv directly when you need custom callbacks, custom fold weighting, custom stratification, or any other lower-level control that is not exposed by this wrapper.

Arguments

  • X: predictor matrix with one row per sample.
  • Y: one-hot response matrix with one row per sample and one column per class.

Keyword arguments

  • spec: CPPLS specification. spec.mode must be :discriminant.
  • fit_kwargs: additional keyword arguments forwarded to fit. If responselabels are absent, default labels are injected automatically.
  • weighted: passed to CPPLS.cv_classification(; weighted=weighted) to control whether the outer and inner scores use inverse-frequency class weighting.
  • num_outer_folds, num_outer_folds_repeats, num_inner_folds, num_inner_folds_repeats, max_components, reshuffle_outer_folds, rng, verbose: forwarded to nestedcv.

Returns

  • outer_fold_scores: one DA score per outer repeat.
  • optimal_num_latent_variables: one selected component count per outer repeat.

See also CPPLS.cv_classification, outlierscan, nestedcv, sampleclasses, permda

julia> using Random;

julia> X = [0.0 0.0; 0.1 0.2; 0.2 0.1; 0.3 0.2; 0.2 0.4; 0.4 0.3;
            2.0 2.0; 2.1 2.2; 2.2 2.1; 2.3 2.2; 2.2 2.4; 2.4 2.3];

julia> classes = repeat(["A", "B"], inner=6);

julia> Y, responselabels = onehot(classes);

julia> spec = CPPLSModel(ncomponents=1, gamma=0.5, mode=:discriminant);

julia> scores, best_k = cvda(
           X,
           Y;
           spec=spec,
           fit_kwargs=(; responselabels=responselabels),
           num_outer_folds=3,
           num_outer_folds_repeats=3,
           num_inner_folds=2,
           num_inner_folds_repeats=2,
           max_components=1,
           rng=MersenneTwister(1),
           verbose=false,
       );

julia> length(scores)
3

julia> best_k == fill(1, 3)
true
source
cvda(
    X::AbstractMatrix{<:Real},
    sampleclasses::AbstractCategoricalArray;
    kwargs...
)

Convert categorical class labels to one-hot form and forward to cvda. If fit_kwargs.responselabels is absent, the observed class labels are injected automatically.

source
cvda(
    X::AbstractMatrix{<:Real},
    sampleclasses::AbstractVector;
    kwargs...
)

Convert non-numeric class labels to one-hot form and forward to cvda. Numeric vectors are rejected because they are ambiguous with regression targets.

source
CPPLS.cvregFunction
cvreg(
    X::AbstractMatrix{<:Real},
    Y::AbstractMatrix{<:Real};
    spec::CPPLSModel,
    fit_kwargs::NamedTuple=(;),
    num_outer_folds::Integer=8,
    num_outer_folds_repeats::Integer=num_outer_folds,
    num_inner_folds::Integer=7,
    num_inner_folds_repeats::Integer=num_inner_folds,
    max_components::Integer=spec.ncomponents,
    reshuffle_outer_folds::Bool=false,
    rng::AbstractRNG=Random.GLOBAL_RNG,
    verbose::Bool=true,
)

Run nested cross-validation for CPPLS regression with the standard regression defaults wired in automatically. This wrapper is a convenience layer over nestedcv that fixes the regression callbacks returned by CPPLS.cv_regression().

The positional arguments X and Y, and the keyword arguments spec, fit_kwargs, num_outer_folds, num_outer_folds_repeats, num_inner_folds, num_inner_folds_repeats, max_components, reshuffle_outer_folds, rng, and verbose have the same meaning as in nestedcv.

Arguments

  • X: predictor matrix with one row per sample.
  • Y: response matrix with one row per sample.

Keyword arguments

  • spec: CPPLS specification. spec.mode must be :regression.
  • fit_kwargs: additional keyword arguments forwarded to fit.
  • num_outer_folds, num_outer_folds_repeats, num_inner_folds, num_inner_folds_repeats, max_components, reshuffle_outer_folds, rng, verbose: forwarded to nestedcv.

Returns

  • outer_fold_scores: one regression score per outer repeat.
  • optimal_num_latent_variables: one selected component count per outer repeat.

Use nestedcv directly when you need custom callbacks, fold-local observation weighting, stratification, or any other lower-level control that is not exposed by this wrapper.

See also CPPLS.cv_regression, nestedcv, permreg

julia> using Random;

julia> X = reshape(collect(1.0:16.0), :, 1);

julia> Y = reshape(2 .* vec(X) .+ 1, :, 1);

julia> spec = CPPLSModel(ncomponents=1, gamma=0.5, mode=:regression);

julia> scores, best_k = cvreg(
           X,
           Y;
           spec=spec,
           num_outer_folds=2,
           num_outer_folds_repeats=2,
           num_inner_folds=2,
           num_inner_folds_repeats=2,
           max_components=1,
           rng=MersenneTwister(1),
           verbose=false,
       );

julia> length(scores)
2

julia> best_k == fill(1, 2)
true
source
cvreg(
    X::AbstractMatrix{<:Real},
    y::AbstractVector{<:Real};
    kwargs...
)

Reshape a univariate response vector into a single-column matrix and forward to cvreg.

source
CPPLS.outlierscanFunction
outlierscan(
    X::AbstractMatrix{<:Real}, 
    Y::AbstractMatrix{<:Real};
    spec::CPPLSModel,
    fit_kwargs::NamedTuple=(;),
    obs_weight_fn::Union{Function, Nothing}=nothing,
    num_outer_folds::Integer=8,
    num_outer_folds_repeats::Integer=10 * num_outer_folds,
    num_inner_folds::Integer=7,
    num_inner_folds_repeats::Integer=num_inner_folds,
    max_components::Integer=spec.ncomponents,
    reshuffle_outer_folds::Bool=true,
    rng::AbstractRNG=Random.GLOBAL_RNG,
    verbose::Bool=true
)

Run repeated nested cross-validation for discriminant analysis and count how often each sample is tested and misclassified. This is a diagnostic companion to nestedcv, not a replacement for it: the goal is to identify samples that repeatedly fail when held out, which can be useful when screening for mislabeled, contaminated, atypical, or otherwise problematic observations.

Unlike nestedcv, this routine fixes the cross-validation callbacks internally by using the bundle returned by CPPLS.cv_classification(). On each outer split it selects the number of latent variables by repeated inner cross-validation, predicts the held-out samples, and records which of those samples were misclassified. Across repeats, each sample accumulates a test count and a misclassification count.

This method expects discriminant-analysis settings, so spec.mode must be :discriminant and Y must be a one-hot response matrix. Stratification is derived automatically from Y via sampleclasses(Y).

Arguments

  • X: predictor matrix with one row per sample.
  • Y: one-hot response matrix with one row per sample and one column per class.

Keyword arguments

  • spec: CPPLS model specification used for every fit. During inner optimization, the routine evaluates component counts 1:max_components by replacing spec.ncomponents on temporary copies of spec.
  • fit_kwargs: additional keyword arguments forwarded to fit. Entries tied to the sample axis, namely obs_weights, samplelabels, sampleclasses, Yaux, and Y_auxiliary, are subset automatically to the current training split. If responselabels are not supplied, they are inferred from the number of columns in Y. Response-column metadata such as response_weights and target_weights are passed through unchanged.
  • obs_weight_fn: callback for fold-local observation weights. By default this is default_da_obs_weight_fn, which applies invfreqweights(sampleclasses(Y_train)) inside each outer or inner training split, matching cvda and permda. It is called as obs_weight_fn(X_train, Y_train; sample_indices=..., fit_kwargs=..., spec=...) and must return either nothing or an AbstractVector of nonnegative finite weights of length size(X_train, 1). These weights are multiplied elementwise with any fixed obs_weights present in fit_kwargs. Pass obs_weight_fn=nothing to disable fold-local weighting.
  • num_outer_folds: number of folds in each outer partition.
  • num_outer_folds_repeats: number of outer-fold evaluations to run. With the default reshuffle_outer_folds=true, new outer partitions are drawn between repeats so a sample can be tested multiple times across different train/test splits.
  • num_inner_folds: number of folds used inside each outer training split to choose the number of latent variables.
  • num_inner_folds_repeats: number of inner folds evaluated per outer split. This cannot exceed num_inner_folds.
  • max_components: largest latent-variable count considered in the inner loop. This search limit is independent of spec.ncomponents; if it is larger, temporary copies of spec are used with the required component count.
  • reshuffle_outer_folds: if true, regenerate the outer folds on each repeat; if false, build one outer partition and reuse its folds. Since outlier scanning is usually meant to probe sample stability across many different holdout sets, the default is true.
  • rng: random-number generator used for fold construction and reshuffling.
  • verbose: if true, print progress for the outer and inner folds.

Returns

  • n_tested: integer vector whose ith entry counts how often sample i appeared in an outer test set.
  • n_flagged: integer vector whose ith entry counts how often sample i was misclassified when it appeared in an outer test set.
  • rate: vector defined as n_flagged ./ max.(1, n_tested). Larger values indicate samples that are more frequently flagged when held out. A sample that was never tested receives rate 0.0.

See also CPPLSModel, fit, CPPLS.cv_classification, invfreqweights, onehot, nestedcv, sampleclasses

julia> using Random;

julia> X = [0.0 0.0; 0.1 0.2; 0.2 0.1; 0.3 0.2; 0.2 0.4; 0.4 0.3;
            2.0 2.0; 2.1 2.2; 2.2 2.1; 2.3 2.2; 2.2 2.4; 2.4 2.3];

julia> classes = repeat(["A", "B"], inner=6);

julia> Y, responselabels = onehot(classes);

julia> spec = CPPLSModel(ncomponents=1, gamma=0.5, mode=:discriminant);

julia> out = outlierscan(
           X,
           Y;
           spec=spec,
           fit_kwargs=(; responselabels=responselabels),
           num_outer_folds=3,
           num_outer_folds_repeats=3,
           num_inner_folds=2,
           num_inner_folds_repeats=2,
           max_components=1,
           rng=MersenneTwister(1),
           verbose=false,
       );

julia> sum(out.n_tested)
12

julia> all(out.n_flagged .≤ out.n_tested)
true

julia> all(0.0 .≤ out.rate .≤ 1.0)
true
source
outlierscan(
    X::AbstractMatrix{<:Real},
    sampleclasses::AbstractCategoricalArray; 
    kwargs...
)

Convert categorical sample classes to one-hot form and forward to outlierscan.

source
outlierscan(
    X::AbstractMatrix{<:Real}, 
    sampleclasses::AbstractVector; 
    kwargs...
)

Convert non-numeric sample classes to one-hot form and forward to outlierscan. Numeric vectors are rejected because they are ambiguous with regression targets.

source
CPPLS.nestedcvFunction
nestedcv(
    X::AbstractMatrix{<:Real}, 
    Y::AbstractMatrix{<:Real};
    spec::CPPLSModel,
    fit_kwargs::NamedTuple=(;),
    obs_weight_fn::Union{Function, Nothing}=default_da_obs_weight_fn,
    score_fn::Function,
    predict_fn::Function,
    select_fn::Function,
    num_outer_folds::Integer=8,
    num_outer_folds_repeats::Integer=num_outer_folds,
    num_inner_folds::Integer=7,
    num_inner_folds_repeats::Integer=num_inner_folds,
    max_components::Integer=spec.ncomponents,
    strata::Union{AbstractVector{<:Integer}, Nothing}=nothing,
    reshuffle_outer_folds::Bool=false,
    rng::AbstractRNG=Random.GLOBAL_RNG,
    verbose::Bool=true
)

Run explicit nested cross-validation for CPPLS. The caller supplies score_fn, predict_fn, and select_fn, so the routine can be used for either regression or classification. spec and fit_kwargs control model fitting, while the return value is (outer_fold_scores, optimal_num_latent_variables).

When provided, obs_weight_fn is called on each training split as obs_weight_fn(X_train, Y_train; sample_indices=..., fit_kwargs=..., spec=...). The callback must return fold-local observation weights matching the current training-set size, or nothing. Any fold-local weights returned by the callback are combined elementwise with fixed obs_weights supplied through fit_kwargs.

For standard use, score_fn, predict_fn, and select_fn can be obtained from CPPLS.cv_classification() or CPPLS.cv_regression(). In contrast, obs_weight_fn is an optional user-supplied callback when fold-specific observation weighting is needed.

When spec.mode == :discriminant, default responselabels are injected if they are not already present in fit_kwargs.

Arguments

  • X: predictor matrix with one row per sample.
  • Y: response matrix with one row per sample. For classification this is typically a one-hot matrix; for regression it contains continuous responses.

Keyword arguments

  • spec: CPPLS model specification used for every fit. During inner optimization, the routine evaluates component counts 1:max_components by replacing spec.ncomponents on temporary copies of spec.
  • fit_kwargs: additional keyword arguments forwarded to fit. Entries tied to the sample axis, namely obs_weights, samplelabels, sampleclasses, Yaux, and Y_auxiliary, are subset automatically to the current training split. Response-column metadata such as response_weights and target_weights are passed through unchanged.
  • obs_weight_fn: optional callback for fold-local observation weights. It receives X_train and Y_train for the current training split, and may also inspect sample_indices, fit_kwargs, and spec through keyword arguments. It must return either nothing or an AbstractVector of nonnegative finite weights of length size(X_train, 1). These weights are multiplied elementwise with any fixed obs_weights present in fit_kwargs.
  • score_fn: scoring callback with signature score_fn(Y_true, Y_pred) -> Real. It is called on held-out responses and the predictions produced by predict_fn for a single candidate component count. The returned scalar is the objective optimized in inner cross-validation and reported on the outer folds.
  • predict_fn: prediction callback with signature predict_fn(model, X_holdout, k). It receives a fitted CPPLS model, the held-out predictor rows to evaluate, and the number of latent variables k. predict_fn must return predictions in a representation and shape that score_fn can compare directly with Y_true.
  • select_fn: model-selection callback with signature select_fn(scores) -> Int. It receives the vector of inner-fold scores for k = 1:max_components and must return a 1-based component count in that range. Its optimization direction must match score_fn, for example argmax for larger-is-better scores or argmin for smaller-is-better losses.
  • num_outer_folds: number of folds in the outer assessment loop.
  • num_outer_folds_repeats: number of outer-fold evaluations to run. When reshuffle_outer_folds=false, this usually matches num_outer_folds so each fold in one fixed outer partition is evaluated once. Larger values require reshuffle_outer_folds=true, in which case new outer partitions are drawn between repeats.
  • num_inner_folds: number of folds used inside each outer training split to choose the number of latent variables.
  • num_inner_folds_repeats: number of inner folds evaluated per outer split. This cannot exceed num_inner_folds.
  • max_components: largest latent-variable count considered in the inner loop. This value defines the search range 1:max_components independently of the ncomponents stored in spec. If max_components > spec.ncomponents, the inner loop and the final outer-fold fit still evaluate those larger component counts by working on temporary copies of spec with ncomponents replaced accordingly.
  • strata: optional integer labels used to build stratified folds. When omitted, folds are created by shuffling sample indices without stratification.
  • reshuffle_outer_folds: if true, regenerate the outer folds on each repeat; if false, build one outer partition and reuse its folds, which is the standard nested cross-validation setup. If true, perform repeated nested cross-validation by drawing a new outer partition on each repeat. This is especially useful for diagnostics such as outlierscan, but it can also be used in ordinary nested CV when repeated random outer splits are desired. For permutation testing, however, the observed run and all permuted runs must use the same reshuffle_outer_folds setting and the same score aggregation; otherwise the resulting p-value is not comparable.
  • rng: random-number generator used for fold construction and any reshuffling.
  • verbose: if true, print progress for outer and inner folds.

Returns

  • outer_fold_scores: one scalar score per outer repeat, obtained by fitting the model on the corresponding outer training split with the selected number of components and evaluating it with score_fn on the outer test split.
  • optimal_num_latent_variables: one selected component count per outer repeat. Each value comes from repeated inner cross-validation on that outer training split.

See also CPPLSModel, fit, pvalue, CPPLS.cv_classification, outlierscan, CPPLS.cv_regression, invfreqweights, onehot, nestedcvperm, sampleclasses

julia> using Random;

julia> X = [0.0 0.0; 0.1 0.2; 0.2 0.1; 0.3 0.2; 0.2 0.4; 0.4 0.3;
            2.0 2.0; 2.1 2.2; 2.2 2.1; 2.3 2.2; 2.2 2.4; 2.4 2.3];

julia> classes = repeat(["A", "B"], inner=6)
12-element Vector{String}:
 "A"
 "A"
 "A"
 "A"
 "A"
 "A"
 "B"
 "B"
 "B"
 "B"
 "B"
 "B"

julia> Y, responselabels = onehot(classes)
([1 0; 1 0; … ; 0 1; 0 1], ["A", "B"])

julia> cb = CPPLS.cv_classification();

julia> m = CPPLSModel(ncomponents=1, gamma=0.5, center_X=true, mode=:discriminant);

julia> obs_weight_fn = (X_train, Y_train; kwargs...) -> invfreqweights(sampleclasses(Y_train));

julia> scores, best_k = nestedcv(
           X,
           Y;
           spec=m,
           fit_kwargs=(; responselabels=responselabels),
           obs_weight_fn=obs_weight_fn,
           score_fn=cb.score_fn,
           predict_fn=cb.predict_fn,
           select_fn=cb.select_fn,
           num_outer_folds=3,
           num_outer_folds_repeats=3,
           num_inner_folds=2,
           num_inner_folds_repeats=2,
           max_components=1,
           strata=sampleclasses(Y),
           rng=MersenneTwister(1),
           verbose=false,
       );

julia> length(scores)
3

julia> all(==(1.0), scores)
true

julia> best_k == fill(1, 3)
true
source
CPPLS.nestedcvpermFunction
nestedcvperm(
    X::AbstractMatrix{<:Real}, 
    Y::AbstractMatrix{<:Real};
    spec::CPPLSModel,
    fit_kwargs::NamedTuple=(;),
    obs_weight_fn::Union{Function, Nothing}=nothing,
    score_fn::Function,
    predict_fn::Function,
    select_fn::Function,
    num_permutations::Integer=999,
    num_outer_folds::Integer=8,
    num_outer_folds_repeats::Integer=num_outer_folds,
    num_inner_folds::Integer=7,
    num_inner_folds_repeats::Integer=num_inner_folds,
    max_components::Integer=spec.ncomponents,
    strata::Union{AbstractVector{<:Integer}, Nothing}=nothing,
    reshuffle_outer_folds::Bool=false,
    rng::AbstractRNG=Random.GLOBAL_RNG,
    verbose::Bool=true
)

Run a permutation test around nestedcv by repeatedly permuting the rows of Y and recomputing the nested cross-validation score. The result is a vector of mean scores, one for each permutation. Each permutation reruns the full nested-CV pipeline, so the null distribution reflects outer-fold assessment, inner-loop model selection, and any fold-local observation weighting supplied through obs_weight_fn.

This function takes the same core callbacks as nestedcv: score_fn, predict_fn, and select_fn can be obtained from CPPLS.cv_classification() or CPPLS.cv_regression(), while obs_weight_fn remains an optional user-supplied callback.

The positional arguments X and Y, and the shared keyword arguments spec, fit_kwargs, obs_weight_fn, the callback trio, the fold controls, max_components, reshuffle_outer_folds, rng, and verbose have the same meaning as in nestedcv. Here, num_permutations additionally controls how many shuffled response runs are performed.

When strata are provided, the same row permutation applied to Y is also applied to strata before nested CV is rerun, so fold construction remains aligned with the permuted responses.

Returns

  • permutation_scores: vector whose ith entry is mean(scores) from the ith call to nestedcv on permuted responses.

To compare these scores with an observed score using pvalue, the observed analysis must use the same score_fn, predict_fn, select_fn, fold settings, reshuffle_outer_folds choice, and score aggregation.

See also CPPLSModel, fit, pvalue, CPPLS.cv_classification, outlierscan, CPPLS.cv_regression, invfreqweights, onehot, nestedcv, sampleclasses

julia> using Random;

julia> X = [0.0 0.0; 0.1 0.2; 0.2 0.1; 0.3 0.2; 0.2 0.4; 0.4 0.3;
            2.0 2.0; 2.1 2.2; 2.2 2.1; 2.3 2.2; 2.2 2.4; 2.4 2.3];

julia> classes = repeat(["A", "B"], inner=6)
12-element Vector{String}:
 "A"
 "A"
 "A"
 "A"
 "A"
 "A"
 "B"
 "B"
 "B"
 "B"
 "B"
 "B"

julia> Y, responselabels = onehot(classes);

julia> cb = CPPLS.cv_classification();

julia> m = CPPLSModel(ncomponents=1, gamma=0.5, center_X=true, mode=:discriminant);

julia> obs_weight_fn = (X_train, Y_train; kwargs...) -> invfreqweights(sampleclasses(Y_train));

julia> permutation_scores = nestedcvperm(
                     X,
                     Y;
                     spec=m,
                     fit_kwargs=(; responselabels=responselabels),
                     obs_weight_fn=obs_weight_fn,
                     score_fn=cb.score_fn,
                     predict_fn=cb.predict_fn,
                     select_fn=cb.select_fn,
                     num_permutations=3,
                     num_outer_folds=3,
                     num_outer_folds_repeats=3,
                     num_inner_folds=2,
                     num_inner_folds_repeats=2,
                     max_components=1,
                     strata=sampleclasses(Y),
                     rng=MersenneTwister(1),
                     verbose=false,
             );

julia> length(permutation_scores)
3

julia> all(0.0 .≤ permutation_scores .≤ 1.0)
true
source
CPPLS.permdaFunction
permda(
    X::AbstractMatrix{<:Real},
    Y::AbstractMatrix{<:Real};
    spec::CPPLSModel,
    fit_kwargs::NamedTuple=(;),
    weighted::Bool=true,
    num_permutations::Integer=999,
    num_outer_folds::Integer=8,
    num_outer_folds_repeats::Integer=num_outer_folds,
    num_inner_folds::Integer=7,
    num_inner_folds_repeats::Integer=num_inner_folds,
    max_components::Integer=spec.ncomponents,
    reshuffle_outer_folds::Bool=false,
    rng::AbstractRNG=Random.GLOBAL_RNG,
    verbose::Bool=true,
)

Run permutation-based nested cross-validation for CPPLS discriminant analysis with the same DA defaults as cvda. Internally, permda uses the same classification callbacks, the same fold-local inverse-frequency weighting rule, and the same class-based stratification as cvda, but places that workflow inside nestedcvperm.

The positional arguments X and Y, and the shared keyword arguments spec, fit_kwargs, weighted, num_outer_folds, num_outer_folds_repeats, num_inner_folds, num_inner_folds_repeats, max_components, reshuffle_outer_folds, rng, and verbose have the same meaning as in cvda.

Arguments

  • X: predictor matrix with one row per sample.
  • Y: one-hot response matrix with one row per sample and one column per class.

Additional keyword arguments

  • num_permutations: number of response permutations evaluated by nestedcvperm.

Returns

  • permutation_scores: vector whose entries are the mean outer-fold DA scores from each permutation run.

Use nestedcvperm directly when you need custom callbacks, custom fold weighting, custom stratification, or any other lower-level control that is not exposed by this wrapper.

See also pvalue, cvda, nestedcv, nestedcvperm, sampleclasses

julia> using Random;

julia> X = [0.0 0.0; 0.1 0.2; 0.2 0.1; 0.3 0.2; 0.2 0.4; 0.4 0.3;
            2.0 2.0; 2.1 2.2; 2.2 2.1; 2.3 2.2; 2.2 2.4; 2.4 2.3];

julia> classes = repeat(["A", "B"], inner=6);

julia> spec = CPPLSModel(ncomponents=1, gamma=0.5, mode=:discriminant);

julia> permutation_scores = permda(
           X,
           classes;
           spec=spec,
           num_permutations=3,
           num_outer_folds=3,
           num_outer_folds_repeats=3,
           num_inner_folds=2,
           num_inner_folds_repeats=2,
           max_components=1,
           rng=MersenneTwister(1),
           verbose=false,
       );

julia> length(permutation_scores)
3

julia> all(0.0 .≤ permutation_scores .≤ 1.0)
true
source
permda(
    X::AbstractMatrix{<:Real},
    sampleclasses::AbstractCategoricalArray;
    kwargs...
)

Convert categorical class labels to one-hot form and forward to permda. If fit_kwargs.responselabels is absent, the observed class labels are injected automatically.

source
permda(
    X::AbstractMatrix{<:Real},
    sampleclasses::AbstractVector;
    kwargs...
)

Convert non-numeric class labels to one-hot form and forward to permda. Numeric vectors are rejected because they are ambiguous with regression targets.

source
CPPLS.permregFunction
permreg(
    X::AbstractMatrix{<:Real},
    Y::AbstractMatrix{<:Real};
    spec::CPPLSModel,
    fit_kwargs::NamedTuple=(;),
    num_permutations::Integer=999,
    num_outer_folds::Integer=8,
    num_outer_folds_repeats::Integer=num_outer_folds,
    num_inner_folds::Integer=7,
    num_inner_folds_repeats::Integer=num_inner_folds,
    max_components::Integer=spec.ncomponents,
    reshuffle_outer_folds::Bool=false,
    rng::AbstractRNG=Random.GLOBAL_RNG,
    verbose::Bool=true,
)

Run permutation-based nested cross-validation for CPPLS regression with the same defaults as cvreg. Internally, permreg uses the same regression callbacks as cvreg, but places that workflow inside nestedcvperm.

The positional arguments X and Y, and the shared keyword arguments spec, fit_kwargs, num_outer_folds, num_outer_folds_repeats, num_inner_folds, num_inner_folds_repeats, max_components, reshuffle_outer_folds, rng, and verbose have the same meaning as in cvreg.

Arguments

  • X: predictor matrix with one row per sample.
  • Y: response matrix with one row per sample.

Additional keyword arguments

  • num_permutations: number of response permutations evaluated by nestedcvperm.

Returns

  • permutation_scores: vector whose entries are the mean outer-fold regression scores from each permutation run.

Use nestedcvperm directly when you need custom callbacks, fold-local observation weighting, stratification, or any other lower-level control that is not exposed by this wrapper.

See also pvalue, cvreg, nestedcvperm, predict

julia> using Random;

julia> X = reshape(collect(1.0:16.0), :, 1);

julia> y = 2 .* vec(X) .+ 1;

julia> spec = CPPLSModel(ncomponents=1, gamma=0.5, mode=:regression);

julia> permutation_scores = permreg(
           X,
           y;
           spec=spec,
           num_permutations=2,
           num_outer_folds=2,
           num_outer_folds_repeats=2,
           num_inner_folds=2,
           num_inner_folds_repeats=2,
           max_components=1,
           rng=MersenneTwister(1),
           verbose=false,
       );

julia> length(permutation_scores)
2
source
permreg(
    X::AbstractMatrix{<:Real},
    y::AbstractVector{<:Real};
    kwargs...
)

Reshape a univariate response vector into a single-column matrix and forward to permreg.

source