Fit

Model fitting in CPPLS is performed using StatsAPI.fit together with a CPPLSModel. This unified interface supports both regression and discriminant analysis, providing a consistent workflow for a wide range of supervised modeling tasks.

Info

The distinction between regression and discriminant analysis in CPPLS, as specified by the mode keyword in CPPLSModel, determines which convenience functions are available for downstream analysis. For model fitting itself, the essential difference is that discriminant analysis (DA) uses a one-hot encoded $Y$ matrix as the response, whereas regression typically uses a $Y$ vector or matrix with continuously varying values.

CPPLS is flexible, however: the response matrix $Y$ may contain both one-hot encoded columns (for classification or DA) and continuous columns (for regression) at the same time. This allows hybrid models in which predictor variables are aligned with multiple response variables of different types. In such cases, users must encode the $Y$ matrix appropriately and extract the relevant outputs from project and predict.

When primary response columns differ strongly in variance or unit, it may be sensible to set scale_Yprim=true in CPPLSModel so that they are on a comparable footing, and then, if needed, to control their relative influence explicitly through the response_weights and target_weights keyword arguments of StatsAPI.fit.

You can optionally provide observation weights (keyword argument obs_weights), response weights (keyword argument response_weights), target weights (keyword argument target_weights), and auxiliary response information (keyword argument Yaux) to StatsAPI.fit. Observation weights control the influence of each sample on the model and are especially useful in discriminant analysis when classes are imbalanced. Response weights control the influence of each response column, both primary and auxiliary, on the supervised compression step. Target weights control the influence of each primary response column in the canonical-correlation step. Auxiliary responses guide the supervised projection without becoming prediction targets themselves. Together with the gamma parameter, which balances predictor scale and predictor-response association, these options allow the user to tailor a CPPLS model to the structure of the dataset. Other choices, such as the number of components, may also be important depending on the application.

If you plan to use observation weights, response weights, target weights, or auxiliary responses, these choices should be made before selecting gamma, because all of them can affect the supervised objective and therefore the most appropriate value of gamma.

Quick Start

The same fit entry point is used for both discriminant analysis and regression. The main difference is the structure of Yprim: in DA it is a one-hot representation of the classes, whereas in regression it is a continuous vector or matrix.

For a plain discriminant-analysis fit:

using CPPLS
using StatsAPI

m = CPPLSModel(ncomponents=2, gamma=0.5, mode=:discriminant)
mf = fit(m, X, classes)

For a plain regression fit:

using CPPLS
using StatsAPI

m = CPPLSModel(ncomponents=2, gamma=0.5, mode=:regression)
mf = fit(m, X, Y)

To add class balancing and auxiliary supervision in DA:

using CPPLS
using StatsAPI

m = CPPLSModel(ncomponents=2, gamma=0.5, mode=:discriminant)
mf = fit(
    m,
    X,
    classes;
    obs_weights=invfreqweights(classes),
    Yaux=Y_aux
)

For complete worked examples, including score plots, gamma selection, and a regression workflow with auxiliary responses, see Fit Examples.

Centering, Scaling, and Response Weighting

CPPLS provides convenience options for centering and scaling, but these options are intentionally asymmetric across $X$, Yprim, and Yaux, because these matrices do not enter the algorithm in the same way.

For the predictor matrix $X$, centering is usually the most important preprocessing step. CPPLS extracts latent components from $X$, and without centering, the model can partly reflect absolute measurement levels rather than variation among samples. Centering therefore makes the score space represent deviations from the average sample rather than deviations from an arbitrary zero point. For this reason, centering of $X$ is the default in CPPLSModel. Scaling $X$ is a separate modeling choice. If predictor variables differ strongly in variance or physical scale, scaling gives them a more equal opportunity to contribute. If large predictor variance is scientifically meaningful, leaving $X$ unscaled allows that information to remain part of the model. In CPPLS this matters directly, because the power parameter gamma mixes predictor scale and predictor-response association when constructing the supervised projection. The default setting of scale_X is therefore false.

For the primary response block Yprim, scaling can be useful, especially in multivariate regression. When primary response columns differ strongly in variance or unit, scaling helps prevent high-variance responses from dominating the target block purely because of magnitude. In discriminant analysis, where Yprim is typically a one-hot matrix, scaling is usually less important, since class imbalance is more naturally addressed through obs_weights. The default setting of scale_Yprim is therefore also false. CPPLS does not expose centering of Yprim as a user option, because the response-guided step of the algorithm works with predictor-response correlations, and those correlations already center response columns internally. A separate centering option for Yprim would therefore be redundant and would give the impression of additional control without materially changing the supervised projection.

The auxiliary response block Yaux is treated differently again. Auxiliary variables guide the construction of the supervised space, but they are not prediction targets. In the CPPLS implementation used here, they enter through predictor-response correlations rather than raw covariances. Because correlation is invariant to affine rescaling apart from sign, ordinary centering and scaling of Yaux do not provide meaningful control over how strongly auxiliary responses influence the model. What matters is the pattern of an auxiliary variable across samples, not its numerical unit. For this reason, centering and scaling options for Yaux are not exposed. If a primary or auxiliary response should have more or less influence on the model, this should be controlled through response_weights and target_weights rather than through preprocessing scale changes.

API

The reference below documents the fit interface itself.

StatsAPI.fit — Function

fit(m::CPPLSModel,
    X::AbstractMatrix{<:Real},
    Yprim::AbstractMatrix{<:Real};
    kwargs...
)
fit(m::CPPLSModel,
    X::AbstractMatrix{<:Real},
    sampleclasses::AbstractCategoricalArray{T,1,R,V,C,U};
    kwargs...
) where {T,R,V,C,U}
fit(m::CPPLSModel,
    X::AbstractMatrix{<:Real},
    sampleclasses::AbstractVector;
    kwargs...
)

Fit a CPPLS model using the StatsAPI entry point and an explicit CPPLSModel. The model specification supplies the number of components, the gamma configuration, centering, the analysis mode, and all numerical tolerances, while the call to fit supplies data, optional weights, auxiliary responses, and label metadata.

When Yprim is provided, it is treated as the primary response block. When sampleclasses is provided, the labels are converted to a one-hot response matrix, class names are inferred as response labels, and the fit is forced to discriminant analysis; m.mode must be :discriminant or an ArgumentError is thrown.

The gamma setting in model may be a fixed scalar, a (lo, hi) tuple, or a vector mixing scalars and tuples. Non-scalar settings trigger per-component selection by choosing the value that yields the largest leading canonical correlation between the supervised projection and the primary responses, and the resulting gamma values are stored in the fitted model. The per-candidate gamma and squared-canonical-correlation values examined during that search are also stored in the fitted model as matrices for downstream diagnostics and plotting. A range such as 0:0.01:1 is treated as a grid of fixed gamma values. To convert such a range into adjacent search intervals, use intervalize(0:0.01:1), which yields [(0.0, 0.01), (0.01, 0.02), ...]. If you want interval-wise Brent searches, pass intervalize(...) to gamma. Tuple intervals are treated as closed intervals: both endpoints are evaluated explicitly, and the final choice is the best among the two endpoints and the interior Brent minimizer.

Keyword arguments accepted by fit include obs_weights for per-sample weighting, Yaux for auxiliary response columns, response_weights for weighting the columns of the combined response block used to construct the supervised projection, and target_weights for weighting the primary-response columns in the later CCA alignment step. Optional samplelabels, predictorlabels, responselabels, and sampleclasses metadata are also accepted for diagnostics and plotting. Yaux must have the same number of rows as X and is concatenated to Yprim internally to build the supervised projection, while prediction targets always remain the primary responses. If omitted, both weight vectors default to ones. response_weights must have length size(Yprim, 2) + size(Yaux, 2) when Yaux is provided and size(Yprim, 2) otherwise; target_weights must have length size(Yprim, 2).

The return value is a CPPLSFit containing scores, loadings, regression coefficients, and the metadata needed for downstream prediction and diagnostics. Use CPPLS.fit or StatsAPI.fit when disambiguation is required in your namespace.

Examples

julia> using JLD2; file = CPPLS.dataset("synthetic_cppls_da_dataset.jld2");

julia> labels, X, classes, Yaux = load(file, "sample_labels", "X", "classes", "Y_aux");

julia> m = CPPLSModel(ncomponents=2, gamma=0.01:0.01:1.00, mode=:discriminant)
CPPLSModel
  ncomponents: 2
  gamma: 0.01:0.01:1.0
  center_X: true
  scale_X: false
  scale_Yprim: false
  mode: discriminant

julia> cpplsfit = fit(m, X, classes; samplelabels=labels);

julia> size(CPPLS.xscores(cpplsfit))
(100, 2)

julia> m = CPPLSModel(ncomponents=2, gamma=0.75, mode=:discriminant)
CPPLSModel
  ncomponents: 2
  gamma: 0.75
  center_X: true
  scale_X: false
  scale_Yprim: false
  mode: discriminant

julia> cpplsfit = fit(m, X, classes; obs_weights=invfreqweights(classes), Yaux=Yaux)
CPPLSFit
  mode: discriminant
  samples: 100
  predictors: 14
  responses: 2
  components: 2
  gamma: [0.75, 0.75]

source