9 Measurement Invariance

9.1 Measurement Invariance (Week 9) Overview

Learning Objectives

By the end of this tutorial & workshop, students will be able to:

Explain the purpose of measurement invariance (MI) and why it is essential for group comparisons
Understand the statistical foundations of MI within the CFA/SEM framework
Differentiate between levels of invariance — configural, metric, scalar, and strict
Fit and compare multi-group CFA models in lavaan
Evaluate invariance using changes in fit and parameter estimates
Identify and address violations using partial invariance approaches
Report MI analyses clearly and transparently

Structure

Section 1: Lecture

Part 1: Understanding Measurement Invariance — what it is, levels, and logic

Section 2: Code Walkthrough

Part 2: Testing Measurement Invariance — fitting and comparing models using lavaan

Section 3: Worksheet

Part 3: Exercises — apply invariance testing to the DLBS data

9.2 Introduction

Scale Development Process (Recap)

You may remember from previous weeks that developing a psychological scale is a multi-stage process that moves from theory to statistical testing and back again and that, generally speaking, it should follow these key phases:

Define the construct
Generate items
Collect data
1. Explore scale structure and remove problematic items (EFA)
2. Evaluate internal consistency of the factors (reliability analysis)
Collect new data (or split the EFA data and use the remaining portion)
1. Evaluate the factor structure suggested by EFA using confirmatory factor analysis (CFA)
2. Evaluate whether the scale measures the same construct in the same way across groups (measurement invariance; MI)
Collect new data (ideal, but rare in practice)
1. Examine validity evidence (e.g., convergent/discriminant validity, criterion validity)
Ongoing validity and generalisabilty checks (also rare in practice)
1. Evaluate and refine the scale across new samples, contexts, and time

This week, we’re going to learn how to evaluate whether the scale measures the same construct in the same way across groups using measurement invariance (MI).

9.3 Part 1: Understanding Measurement Invariance

Where This Fits in Scale Development

By this point, you’ve already done the heavy lifting of figuring out what your scale looks like and whether that structure holds up statistically.

In EFA, you asked: “What structure is in the data?”. You let the data speak, explored patterns, and identified a plausible factor structure.

In CFA, you tightened the screws and asked: “Does this structure hold?”. You specified a model based on theory (in this case, EFA) and tested whether it fits in a new dataset.

Now we move one step further and ask: “Does this structure hold equally across different groups?”

Up until now, you’ve only been asking whether your model works in general. Now you’re asking whether it works fairly.

This is measurement invariance — the final boss of the measurement model.

Why Measurement Invariance Matters

Let’s start with the uncomfortable truth:

You cannot meaningfully compare groups unless your measure behaves the same way across them.

That includes all the comparisons people love to make, such as:

Gender comparisons
Cultural comparisons
Year group comparisons
Intervention vs control

All of those hinge on a very strong assumption — usually untested — that the scale is operating equivalently across groups.

If that assumption is wrong, your conclusions are on shaky ground.

Without measurement invariance, differences in observed scores might reflect:

Different interpretations of items
Different response styles
Item bias
Or other measurement artefacts

…and not actual differences in the underlying construct. That is the slightly uncomfortable part.

Uncomfortable, because if we accept that and confront it, it have huge implications for our research.

You can have excellent global fit indices, strong factor loadings, and clean, interpretable factors, but still be comparing apples to something that looks like an apple but is psychologically an orange — that is, it is doing something completely different.

STARS Example

Let’s make this concrete using the STARS scale.

Imagine you want to compare statistics anxiety in two groups:

Group A: Neurotypical students
Group B: Neurodivergent students

You run your analysis and find neurodivergent students score higher.

Most researchers would simply conclude that “neurodivergent students experience more statistics anxiety”. Nice. Clean. Publishable.

But… let’s slow that down for a second.

What if:

Some items are interpreted differently across groups (e.g., “interpreting a table” might tap different cognitive processes depending on how someone processes information)?
Certain items are more cognitively demanding for one group, independent of anxiety?
One group is more likely to use higher response categories, even at the same underlying level of anxiety?

Now ask yourself what that higher score is actually capturing?

It might not be more anxiety. It might be:

differences in processing style
differences in how items are understood
or systematic differences in how people respond to the scale

In other words, you are not necessarily comparing anxiety — you may be comparing how the measurement behaves across groups.

That’s the uncomfortable bit.

Without measurement invariance, you can’t tell whether observed group differences reflect real differences in the latent construct or differences in how the construct is being measured.

Measurement invariance is what lets you separate those two.

Without it, your interpretation is — at best — ambiguous, and at worst, just wrong.

The Core Idea

If we strip everything back, measurement invariance is asking:

Can we use the same measurement model to explain relationships between variables in different groups?

Because if we can’t, then:

the latent construct may not being measured in the same way
the scale may not operate equivalently
and group comparisons may be questionable

Exactly which of those are problems depends on the level of invariance.

Levels of Invariance

Measurement invariance isn’t tested in one go.

Instead, we take a stepwise approach, gradually increasing how strict we are about equality across groups.

Each step asks a slightly stronger question than the last.

Think of it like turning up the pressure on your model and seeing when it starts to crack.

1. Configural Invariance (Baseline Model)

At this stage, we are being deliberately relaxed.

Same factor structure across groups
No equality constraints
Everything (loadings, intercepts, residuals) is freely estimated

What are we testing?

Do the groups conceptualise the construct in the same basic way?

In practical terms:

Do the same items cluster together?
Does the same factor structure make sense in each group?

We are not saying anything about how strongly items relate to the factor yet — just that the overall pattern is the same.

If this fails, stop. Because if different groups don’t even share the same conceptual structure, then you are not measuring the same construct and everything else becomes meaningless.

2. Metric Invariance (Weak Invariance)

Now we start tightening things.

Factor loadings are constrained to be equal across groups

What are we testing?

Do items relate to the latent construct in the same way across groups?

This is about scale units.

If an item is a strong indicator of the factor in one group, is it equally strong in another?

If metric invariance holds:

A one-unit change in the latent construct means the same thing across groups.

That unlocks something important:

✔ You can compare relationships involving the construct (e.g., correlations, regressions, path models)

Why? Because the scale of the latent variable is now comparable.

Without metric invariance, even a simple correlation difference could just reflect different measurement properties.

3. Scalar Invariance (Strong Invariance)

Now we add another layer of constraint:

Factor loadings and intercepts (or thresholds for ordinal data) are equal

What are we testing?

Do groups interpret the response scale in the same way?

This is about scale origin.

Even if two people have the same level of the latent trait, do they:

endorse the same response category?
start from the same baseline?

If scalar invariance holds:

The “zero point” (or baseline) of the scale is aligned across groups.

This is the big one.

✔ You can now compare latent means

And here’s the slightly uncomfortable truth:

This is what most people think they’re doing when they compare group means — but usually without testing it.

Without scalar invariance, mean differences are ambiguous. They could reflect:

true differences in the construct
or systematic differences in how groups respond to items

4. Strict Invariance

Now we go all in:

Loadings + intercepts + residual variances are equal

What are we testing?

Do items have the same measurement precision across groups?

Residuals capture:

measurement error
item-specific variance not explained by the factor

If strict invariance holds:

The scale is equally reliable across groups.

This allows:

✔ Comparison of observed scores (not just latent variables)

Reality Check

Strict invariance is:

Rarely achieved
Rarely necessary

If you demanded strict invariance for every study, you’d publish approximately nothing.

Most of the time:

Scalar invariance is enough for meaningful group comparisons
Everything beyond that is a bonus, not a requirement

Furthermore, when working with ordinal data using WLSMV, strict invariance is typically not tested. This is because residual variances are not freely estimated in the same way as in continuous models, meaning there are no additional parameters to constrain across groups. As a result, scalar invariance is usually considered the highest level of invariance assessed.

Partial Invariance

Full invariance is the textbook ideal. Partial invariance is the reality.

Partial invariance means that some parameters differ across groups, but enough are equal to still support meaningful comparisons

For example:

Most loadings are equal, but one item behaves differently
Most thresholds are equal, but one item is systematically endorsed differently

In practice, you:

Identify non-invariant parameters
Relax those specific constraints
Re-fit the model

The key idea:

You don’t throw the whole model out because of one problematic item.

How This Links to CFA

This is not a new model. This is just CFA, repeated across groups, with constraints added.

So, everything you learned in the CFA section — especially:

model specification
covariance reproduction
parameter interpretation

…still applies here.

Measurement invariance just forces you to confront a harder question:

Not just “Does the model fit?” but “Does the model fit in the same way across groups?”

Statistical Foundations

You’ve already met the core idea behind CFA.

To recap:

We are trying to explain the covariance matrix of the observed variables using a smaller number of latent factors.

In other words, CFA is not modelling raw scores — it is modelling relationships between variables.

Formally, this is done using the common factor model, which — as we’ve seen previously — generates the model-implied covariance matrix by combining the matrix of factor loadings (how items relate to factors), the covariance matrix of the latent factors (how factors relate to each other), and residual variance.

Common Factor Model Equation

The model-implied covariance matrix is:

\[ \boldsymbol{\Sigma} = \boldsymbol{\Lambda} \boldsymbol{\Phi} \boldsymbol{\Lambda}^{\top} + \boldsymbol{\Theta} \]

Where:

\(\boldsymbol{\Sigma}\) = model-implied covariance matrix (what the model predicts)
\(\boldsymbol{\Lambda}\) = matrix of factor loadings (how strongly items relate to factors)
\(\boldsymbol{\Phi}\) = covariance matrix of the latent factors
\(\boldsymbol{\Theta}\) = residual (error/unique variance) covariance matrix

In other words, it is asking:

Can this combination of loadings, factor relationships, and residuals reproduce the covariance structure we observe in the data?

If yes → the model fits (well enough).
If no → something about the model (or theory) is off.

Measurement Invariance Extends This Idea

Measurement invariance takes this exact logic and raises the stakes.

Instead of asking whether one model fits one dataset, we now ask:

Can the same model operate equivalently across multiple groups?

This is a subtle but crucial shift. We are no longer just interested in fit. We are interested in comparability.

The Multi-Group Extension

In measurement invariance, we estimate the model separately for each group, which gives us a group-specific covariance structure.

That is, for each group, we model the observed covariance matrix using that group’s matrix of factor loadings (how items relate to factors), the covariance matrix of the latent factors (how the factors relate to each other), and a residual covariance matrix capturing variance the model does not explain.

Every group is allowed to have its own CFA model.

Group-Specific Covariance Matrix Equation

\[ \boldsymbol{\Sigma}_g = \boldsymbol{\Lambda}_g \boldsymbol{\Phi}_g \boldsymbol{\Lambda}_g^{\top} + \boldsymbol{\Theta}_g \]

Where:

\(g\) = group index (e.g., neurotypical vs neurodivergent students)
\(\boldsymbol{\Sigma}_g\) = observed covariance matrix for group \(g\)
\(\boldsymbol{\Lambda}_g\) = matrix of factor loadings for group \(g\)
\(\boldsymbol{\Phi}_g\) = covariance matrix of the latent factors for group \(g\)
\(\boldsymbol{\Theta}_g\) = residual covariance matrix for group \(g\)

However, if all parameters are freely estimated across groups, we are not testing invariance — we are just fitting separate CFAs.

In this situation, each group has its own set of factor loadings, intercepts, and residuals, meaning the model is allowed to operate differently in each group. Although the same overall structure is specified, the actual parameter values can vary freely.

As a result, all we’d learn is that the model fits reasonably well within each group on its own, and that is not the same as showing the model is equivalent across groups…

What Measurement Invariance Actually Tests

To test measurement invariance, we impose equality constraints on key parts of the model and examine whether the model still fits the data well.

This is what allows us to determine whether the measurement model is operating in the same way across groups, rather than simply fitting each group independently.

Instead of letting everything vary freely, we start asking a more demanding question:

What happens if we force certain aspects of the model to be the same across groups?

For example:

Do items relate to the construct in the same way across groups?
Do groups interpret the response scale in the same way?
Do items have the same amount of measurement error?

Each of these corresponds to a different part of the model that we can choose to constrain to be equal.

The logic is simple but powerful:

If adding these constraints does not substantially worsen model fit → the model is operating equivalently across groups
If model fit does worsen → something about the measurement differs across groups

So, step by step, we move from a model where each group is estimated independently, towards models where more and more aspects of the measurement process are shared across groups.

9.4 The Key Question

At its core, measurement invariance comes down to a practical decision:

What needs to be different across groups, and what can we reasonably assume is the same?

In a multi-group CFA, we are fitting the same overall model to multiple groups. The crucial choice is:

Do we allow each group to have its own version of the model?
Or do we require parts of the model to be shared across groups?

More specifically:

Do we need different factor loadings (do items relate to the construct in the same way)?
Do we need different intercepts or thresholds (do groups use the response scale in the same way)?
Do we need different residual variances (is measurement error the same)?

Or:

Can we constrain these to be equal without breaking the model?

If we can, that tells us something important:

The scale is operating equivalently across groups.

Measurement Invariance Testing Equations

We start with a model where everything is freely estimated across groups:

\[ \boldsymbol{\Sigma}_g = \boldsymbol{\Lambda}_g \boldsymbol{\Phi}_g \boldsymbol{\Lambda}_g^{\top} + \boldsymbol{\Theta}_g \]

Here, each group has its own:

factor loadings (\(\boldsymbol{\Lambda}_g\))
covariance matrix of the latent factors (\(\boldsymbol{\Phi}_g\))
residual covariance matrix (\(\boldsymbol{\Theta}_g\))

This is just separate CFAs for each group.

Adding Constraints (Example: Metric Invariance)

We then begin constraining parameters across groups.

For example, if we constrain the factor loadings to be equal:

\[ \boldsymbol{\Sigma}_g = \boldsymbol{\Lambda} \boldsymbol{\Phi}_g \boldsymbol{\Lambda}^{\top} + \boldsymbol{\Theta}_g \]

Now:

Factor loadings are the same across groups (\(\boldsymbol{\Lambda}\))
Factor covariances (\(\boldsymbol{\Phi}_g\)) and residuals (\(\boldsymbol{\Theta}_g\)) can still vary

Note: The model equation itself does not fundamentally change across levels of invariance. What changes is whether parameters carry a group-specific subscript (e.g., \(\boldsymbol{\Lambda}_g\)), meaning they are freely estimated in each group, or no subscript (e.g., \(\boldsymbol{\Lambda}\)), meaning they are constrained to be equal across groups. In other words, we are always fitting the same model — we are simply restricting how similar it must be across groups.

Going Further (Example: Scalar Invariance)

If we also constrain intercepts (not shown in covariance form), we further restrict how the model operates across groups.

At each step:

We are testing whether the same parameter values can explain the covariance structure in all groups.

The Logic of Testing

Measurement invariance is tested by comparing:

A model where parameters are free across groups
A model where some parameters are constrained to be equal

If model fit does not meaningfully worsen:

The constrained parameters are invariant across groups.

We fit models sequentially:

Configural
Metric
Scalar

And compare them.

We’re not asking:

“Is the model perfect?”

We’re asking:

“Does adding constraints make things meaningfully worse?”

How Do We Decide?

You could use chi-square difference tests.

But with large samples (like STARS), they will almost always say:

❌ “Everything is significantly worse”

So instead, we look at changes in fit indices:

ΔCFI ≤ .01
ΔRMSEA ≤ .015
ΔSRMR ≤ .030 (metric), ≤ .015 (scalar)

These are rules of thumb — not commandments.

What Changes with Ordinal Data?

Use WLSMV
Intercepts are replaced by thresholds
- If thresholds differ then groups use response categories differently
- So, the same observed score ≠ same latent trait

Assumptions & Data Requirements

Same as CFA

Correct model
Independent observations
Adequate sample size
Meaningful covariances

Additional for MI

Sufficient N per group
Comparable group distributions
Same response scale functioning

9.5 Part 2: Testing Measurement Invariance in R

Process Overview

This mirrors CFA — just with more constraints.

Step 0: Pre-Analysis Checks

Everything from CFA still applies, plus:

Adequate sample size per group
Reasonable group balance
Same model fits within each group

Step 1: Fit Baseline CFA

Check baseline CFA model is a good fit (otherwise MI is pointless)

Step 2: Fit Configural Model

Step 3: Add Constraints Sequentially

Metric
Scalar
Strict (optional)

Step 4: Compare Models

Fit index changes
Parameter changes
Substantive interpretation

Step 5: Report

In-text
Tables

The Statistics Anxiety Rating Scale (STARS)

Let’s go through each of these steps using the same STARS data and CFA model from CFA week.

Here’s a reminder of the measurement model (minus stars_help_1):

This time our research question is:

Does the STARS measures the same construct in the same way in the original English vs. the non-English versions?

(IRL, we’d usually want to look at this language-by-language, but we’re collapsing languages for educational reasons.)

Step 0: Pre-Analysis Checks

Before testing measurement invariance, all the usual CFA pre-analysis checks still apply (we won’t revisit them again now though). However, because we are now working with multiple groups, there are a few additional requirements to consider.

Adequate Sample Size Per Group

It’s not enough for the total sample to be large — each group must have sufficient cases to estimate the model reliably. A model that works well overall can break down within smaller subgroups.

There is no universal minimum sample size per group for measurement invariance testing. However, some practical guidelines are useful:

~100–200 per group → Often adequate for simple, well-behaved models (few factors, strong loadings)
~200–500 per group → Safer for most applied CFA/MI work
500+ per group → Preferred for ordinal data with WLSMV or more complex models

That said, these are rules of thumb, not guarantees. See the call-out below for more information.

How Large Should Each Group Be?

There is no universal minimum sample size per group for measurement invariance testing. Required sample size depends on:

Model complexity (number of factors, items, and parameters)
Strength of factor loadings (stronger loadings need fewer cases)
Estimator used (e.g., WLSMV typically needs larger samples than Maximum Likelihood)

However, some practical guidelines are useful:

~100–200 per group → Often adequate for simple, well-behaved models (few factors, strong loadings)
~200–500 per group → Safer for most applied CFA/MI work
500+ per group → Preferred for ordinal data with WLSMV or more complex models

That said, these are rules of thumb, not guarantees.

A better way to think about it is:

Do I have enough data in each group for the model to converge, produce stable estimates, and show acceptable fit?

Rather than relying on fixed cut-offs, you should:

Fit the model separately in each group
- Does it converge?
- Are the loadings sensible (e.g., not tiny, not > 1)?
- Are standard errors reasonable?
Check model fit within each group
- If the model fits well in large groups but poorly in small ones, that’s a red flag for sample size issues.
Watch for warning signs of too-small samples
- Non-convergence
- Heywood cases (negative variances)
- Huge standard errors
- Unstable fit indices

Reasonable Group Balance

Extremely unequal group sizes can cause problems for estimation and model comparison. Invariance testing relies on comparing model fit across groups, so if one group dominates the sample, results may be misleading or unstable.

There is no strict rule for acceptable group balance, but a useful way to think about this is in terms of ratios between groups:

Up to ~2:1 → Generally fine
Around 3:1–5:1 → Usually acceptable, but proceed with caution
> 5:1–10:1 → Potentially problematic
Extreme imbalance (e.g., 10:1+) → Likely to distort results

See the call-out below for more information.

How Imbalanced is Too Imbalanced?

What you’re really asking is: “When does imbalance start to meaningfully distort estimation and comparisons?”.

The honest answer is: when it starts breaking things!

There is no strict rule for acceptable group balance, but a useful way to think about this is in terms of ratios between groups:

Up to ~2:1 → Generally fine
Around 3:1–5:1 → Usually acceptable, but proceed with caution
> 5:1–10:1 → Potentially problematic
Extreme imbalance (e.g., 10:1+) → Likely to distort results

Why Does Imbalance Matter?

In multi-group CFA, the model is estimated simultaneously across groups, but:

Larger groups contribute more information
Fit indices and parameter estimates are dominated by the largest group
Smaller groups can become statistically “invisible”

This creates a situation where:

The model can appear to fit well overall, even if it fits poorly in the smaller group.

Even worse, invariance decisions (ΔCFI, ΔRMSEA, etc.) may reflect the large group almost entirely, masking meaningful differences.

How Do You Actually Diagnose a Problem?

Don’t rely on ratios alone — check behaviour:

Fit the model separately in each group
- Does the small group show worse fit or instability?
Compare parameter estimates across groups
- Are loadings wildly different in the smaller group?
- Are standard errors much larger?
Look for instability in the smaller group
- Non-convergence
- Large SEs
- Odd estimates (e.g., near-zero or inflated loadings)

The Same Model Fits Within Each Group

Before testing invariance, you need evidence that the basic measurement model is plausible in each group separately. This is typically done by fitting the model independently within each group and checking model fit and parameter estimates.

If the model fits poorly in one group, invariance testing becomes meaningless — you are effectively comparing a well-fitting model to a poorly fitting one.

In this situation, lack of invariance may reflect model misspecification in one group, rather than genuine differences in how the construct operates across groups.

In practice, you should check:

Model fit within each group (e.g., CFI/TLI, RMSEA, SRMR)
Factor loadings (are they reasonably strong and in the expected direction?)
Any estimation problems (e.g., non-convergence, large standard errors, negative variances)

If the model does not fit adequately in all groups, you should revise the measurement model before proceeding — not attempt to “fix” the problem through invariance testing.

🤔 Does our data pass the pre-analysis sample size & ratio checks?

Let’s address the group sample size and ratio first.

First, we need to read in the data (I’ve already removed the stars_help_1 item that we previously dropped as it was redundant).

stars_data <- readr::read_csv(here::here("data/stars_mi_data.csv"))

Then, to get our group sample size, we need to count how many participants are in each group.

We can do that with a quick count:

stars_data |> dplyr::count(language)

# A tibble: 2 × 2
  language        n
  <chr>       <int>
1 English      3649
2 Non-English  3236

Then we need to calculate the ratio.

Essentially, we just need to divide the number in the English language group by the number in the non-English language group.

We can do that with the following code:

stars_data |>
  dplyr::count(language) |>
  dplyr::summarise(
    ratio = n[language == "English"] / n[language == "Non-English"]
  )

# A tibble: 1 × 1
  ratio
  <dbl>
1  1.13

Solution

✔️ Our group sample sizes are very large

✔️ Our group sample size ration is 1.13:1, which is pretty good

🤔 Does our data pass the pre-analysis individual group model fit checks?

To check this, we need to specify the measurement model and then run the CFA in each group using split samples.

(This is very similar to what we did last week and to the next step [configural invariance], so we won’t complete all these steps in the workshop, but the detail is here for future reference and is an excuse to show you some more advanced R code.)

Specify the measurement model:

stars_model <- '

interpret =~ stars_int_3 + stars_int_11 + stars_int_1 +
             stars_int_6 + stars_int_5 + stars_int_2

test =~ stars_test_3 + stars_test_6 + stars_test_4 +
        stars_test_1 + stars_test_8

help =~ stars_help_2 + stars_help_3 + stars_help_4

'

Split the sample and fit the CFA models:

There are various ways to do this, but I’ve opted to show you an efficient and reproducible solution that uses more advanced code than what you’ll have encountered on the course so far. Probably.

This code fits the same CFA model separately within each group by filtering the dataset, estimating the model independently, and storing the results in a named list for comparison.

1item_vars <- c(
  "stars_int_3", "stars_int_11", "stars_int_1",
  "stars_int_6", "stars_int_5", "stars_int_2",
  "stars_test_3", "stars_test_6", "stars_test_4",
  "stars_test_1", "stars_test_8",
  "stars_help_2", "stars_help_3", "stars_help_4"
)

2groups <- unique(stars_data$language)

3fits <- purrr::map(groups, ~ {
  
  stars_grouped <- stars_data |>
4    dplyr::filter(language == .x) |>
5    dplyr::select(dplyr::all_of(item_vars))
  
6  lavaan::cfa(
    model = stars_model,
    data = stars_grouped,
    estimator = "WLSMV",
    ordered = item_vars
  )
  
}) |>
7  rlang::set_names(groups)

1: Defines a character vector of item variable names to be used in the CFA.
2: Extracts the unique group labels (i.e., "English", "Non-English") which we will use to loop over each group separately
3: Applies a function that loops over each group in groups — .x represents the current group value (e.g., "English"); map() will return a list of fitted models (one per group)
4: Filters the dataset to only include one group at a time (e.g., only English participants)
5: Keeps only the scale items used in the CFA (as listed in item_vars) to prevent grouping variables or other columns from entering the model
6: Fits our CFA model (as per CFA week)
7: Assigns names to each model in the list to make the output easier to interpret (e.g., "English", "Non-English" instead of [[1]], [[2]])

Compare model fit for each group:

This code extracts key fit statistics from each group-specific CFA model and combines them into a single table, making it easy to compare how well the model fits in each group.

fit_summary <- fits |> 
1  purrr::map_dfr(
2    ~ lavaan::fitMeasures(.x,
      c("cfi", "tli", "rmsea", "srmr")
    ),
3    .id = "group"
  )

fit_summary

1: Applies a function to each model in the list and returns the results as a data frame (instead of a list); _dfr = “map + bind rows”
2: Extracts the specified fit indices for each group
3: Adds a column called "group" using the names of the fits list (e.g., "English", "Non-English") to allow you to identify which row corresponds to which group

# A tibble: 2 × 5
  group       cfi        tli        rmsea      srmr      
  <chr>       <lvn.vctr> <lvn.vctr> <lvn.vctr> <lvn.vctr>
1 Non-English 0.9925286  0.9908122  0.07258030 0.05216827
2 English     0.9956694  0.9946746  0.05801045 0.03935945

Solution

Is model fit acceptable within each group?

✔️ The model (hypothesised factor structure) fits well in both groups

However…

Non-English fits slightly better
English fits slightly worse
But the differences are small:
- ΔCFI ≈ .003 → trivial
- ΔRMSEA ≈ .015 → noticeable but not dramatic

This could be due to:

Sampling variation (e.g., one group might be more heterogeneous, noisier, or a little less consistent in responses)
Model misspecification (e.g., one group may have one item behaving different,a residual correlation exists in one group only, a loading is weaker)
Non-invariance (but this is a flag, not a diagnosis — we need model constraints for that)

Compare factor loadings for each group:

This code extracts the factor loadings from each group-specific CFA model and combines them into a single table, allowing you to assess the strength and consistency of item–factor relationships across groups.

loadings <- fits |>
1  purrr::map_dfr(
2    ~ lavaan::parameterEstimates(.x, standardized = TRUE) |>
3      dplyr::filter(op == "=~") |>
4      dplyr::select(lhs, rhs, est, std.all),
5    .id = "group"
  )

loadings

1: Applies a function to each model and combines the results into a single data frame;_dfr = map + bind rows
2: Extracts all parameter estimates from each model — standardized = TRUE adds standardised estimates (std.all; these are easier to interpret than raw (unstandardised) values)
3: Keeps only factor loadings (=~ indicates relationships between latent variables and their indicators)
4: Keeps only relevant columns — lhs: latent factor, rhs: observed item, est: unstandardised loading, std.all: standardised loading
5: Adds a column identifying the group for each row usingthe names from the fits list

         group       lhs          rhs   est std.all
1  Non-English interpret  stars_int_3 1.000   0.837
2  Non-English interpret stars_int_11 1.043   0.873
3  Non-English interpret  stars_int_1 1.019   0.853
4  Non-English interpret  stars_int_6 0.964   0.807
5  Non-English interpret  stars_int_5 0.752   0.630
6  Non-English interpret  stars_int_2 0.899   0.752
7  Non-English      test stars_test_3 1.000   0.895
8  Non-English      test stars_test_6 0.912   0.816
9  Non-English      test stars_test_4 0.919   0.823
10 Non-English      test stars_test_1 0.949   0.849
11 Non-English      test stars_test_8 0.778   0.696
12 Non-English      help stars_help_2 1.000   0.833
13 Non-English      help stars_help_3 1.053   0.877
14 Non-English      help stars_help_4 0.993   0.827
15     English interpret  stars_int_3 1.000   0.838
16     English interpret stars_int_11 1.015   0.851
17     English interpret  stars_int_1 1.011   0.847
18     English interpret  stars_int_6 0.949   0.795
19     English interpret  stars_int_5 0.711   0.596
20     English interpret  stars_int_2 0.953   0.799
21     English      test stars_test_3 1.000   0.878
22     English      test stars_test_6 0.956   0.840
23     English      test stars_test_4 0.952   0.836
24     English      test stars_test_1 0.969   0.851
25     English      test stars_test_8 0.766   0.673
26     English      help stars_help_2 1.000   0.849
27     English      help stars_help_3 1.067   0.906
28     English      help stars_help_4 1.017   0.863

Solution

Are factor loadings reasonably strong and in the expected direction within each group?

✔️ Yes - quite clearly

Across both groups:

Most loadings are .75–.90 → strong
A few are around .60–.70 → very acceptable but weaker

Meaning:

Items are clearly related to their factors
Each factor is well-defined by its indicators

Check for estimation issues in each group:

This code checks for common estimation problems (e.g., negative variances, unstable estimates) in each group-specific model, helping ensure that the results can be trusted before proceeding to invariance testing.

If the resulting table isn’t empty, something’s off — don’t just ignore it and carry on.

1issues <- fits |>
2  purrr::map_dfr(
3    ~ lavaan::parameterEstimates(.x) |>
4      dplyr::filter(
5        est < 0 & op == "~~" |
6        se > 10 |
7        abs(est) > 10
      ),
8    .id = "group"
  )

issues

1: Starts with the list of fitted CFA models (fits); each model corresponds to a different group
2: Applies a function to each model and combines results into a single data frame; _dfr = map + bind rows
3: Extracts all parameter estimates from each model (includes loadings, variances, covariances, thresholds, etc.)
4: Filters the output
5: Flags negative variances (Heywood cases); ~~ refers to variances/covariances
6: Flags very large standard errors
7: Flags implausibly large parameter estimates
8: Adds a column identifying which group the issue comes from (helps pinpoint group-specific estimation problems)

 [1] group    lhs      op       rhs      est      se       z        pvalue  
 [9] ci.lower ci.upper
<0 rows> (or 0-length row.names)

Solution

✔️ There are no convergence issues, warnings, or Heywood cases

Step 1: Configural Model

The key question at this stage is whether the same measurement model can be meaningfully applied across groups.

Although we have already fitted the model separately in each group, this does not test measurement invariance. Separate CFAs simply tell us that the model fits reasonably well within each group in isolation. They do not tell us whether the model is operating in a comparable way across groups.

Configural invariance is the first step that brings the groups together into a single multi-group model. Here, we specify the same factor structure in each group, but allow all parameters (loadings, intercepts/thresholds, residuals) to be freely estimated. In other words, the form of the model is held constant, but the values are allowed to differ.

This allows us to test a more meaningful question:

Can the same conceptual model explain the data across groups when estimated simultaneously?

This step is necessary because good fit in separate groups can be misleading. A model might fit each group individually, but still reflect different underlying patterns of relationships. By estimating the model jointly, the configural model evaluates whether a common structure is plausible across groups.

In practical terms, we assess:

Overall model fit in the multi-group model (global fit)
Whether the same pattern of factor–item relationships is supported across groups (local fit)

If the configural model fits well, this suggests that participants across groups conceptualise the construct in a broadly similar way, providing a foundation for more restrictive tests of invariance.

If it does not fit well, this indicates that the basic structure is not comparable across groups, and invariance testing should not proceed.

We can test this by:

Converting our grouping variable to a factor (to ensure lavaan handles it correctly)
Specifying our measurement model (we already done that in the last step)
Specifying a vector of our ordinal item variables (we already done that in the last step too)
Fitting the configural model
Calling a summary of the output

Convert our grouping variable to a factor:

This ensures lavaan treats it as a grouping variable rather than numeric or character (required for multi-group CFA):

stars_data <- stars_data |>
  dplyr::mutate(language = forcats::as_factor(language))

Fit the configural invariance model:

Then fit a configural invariance model, testing whether the same factor structure holds across groups without imposing any equality constraints (yet).

fit_configural <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
1  group = "language",
  estimator = "WLSMV",
  ordered = item_vars
)

lavaan::summary(fit_configural, fit.measures = TRUE, standardized = TRUE)

1: Tells lavaan to fit a multi-group model based on the language variable (all the other code is the same as our previous CFAs)

lavaan 0.6-19 ended normally after 42 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                       146

  Number of observations per group:               Used       Total
    Non-English                                   3227        3236
    English                                       3641        3649

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                              2312.028    3840.784
  Degrees of freedom                               148         148
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  0.608
  Shift parameter                                           35.883
    simple second-order correction                                
  Test statistic for each group:
    Non-English                               2208.230    2208.230
    English                                   1632.554    1632.554

Model Test Baseline Model:

  Test statistic                            377816.121  138627.992
  Degrees of freedom                               182         182
  P-value                                        0.000       0.000
  Scaling correction factor                                  2.728

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.994       0.973
  Tucker-Lewis Index (TLI)                       0.993       0.967
                                                                  
  Robust Comparative Fit Index (CFI)                         0.952
  Robust Tucker-Lewis Index (TLI)                            0.941

Root Mean Square Error of Approximation:

  RMSEA                                          0.065       0.085
  90 Percent confidence interval - lower         0.063       0.083
  90 Percent confidence interval - upper         0.068       0.088
  P-value H_0: RMSEA <= 0.050                    0.000       0.000
  P-value H_0: RMSEA >= 0.080                    0.000       1.000
                                                                  
  Robust RMSEA                                               0.080
  90 Percent confidence interval - lower                     0.077
  90 Percent confidence interval - upper                     0.083
  P-value H_0: Robust RMSEA <= 0.050                         0.000
  P-value H_0: Robust RMSEA >= 0.080                         0.572

Standardized Root Mean Square Residual:

  SRMR                                           0.045       0.045

Parameter Estimates:

  Parameterization                               Delta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured


Group 1 [Non-English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    stars_int_3       1.000                               0.837    0.837
    stars_int_11      1.043    0.010  104.649    0.000    0.873    0.873
    stars_int_1       1.019    0.010  106.379    0.000    0.853    0.853
    stars_int_6       0.964    0.011   86.770    0.000    0.807    0.807
    stars_int_5       0.752    0.016   45.719    0.000    0.630    0.630
    stars_int_2       0.899    0.012   73.586    0.000    0.752    0.752
  test =~                                                               
    stars_test_3      1.000                               0.895    0.895
    stars_test_6      0.912    0.010   93.985    0.000    0.816    0.816
    stars_test_4      0.919    0.010   92.200    0.000    0.823    0.823
    stars_test_1      0.949    0.010   98.645    0.000    0.849    0.849
    stars_test_8      0.778    0.014   56.250    0.000    0.696    0.696
  help =~                                                               
    stars_help_2      1.000                               0.833    0.833
    stars_help_3      1.053    0.014   72.986    0.000    0.877    0.877
    stars_help_4      0.993    0.014   71.738    0.000    0.827    0.827

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.488    0.011   43.236    0.000    0.651    0.651
    help              0.443    0.012   37.956    0.000    0.635    0.635
  test ~~                                                               
    help              0.445    0.012   36.300    0.000    0.597    0.597

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stars_int_3|t1   -0.368    0.023  -16.278    0.000   -0.368   -0.368
    stars_int_3|t2    0.320    0.022   14.250    0.000    0.320    0.320
    stars_int_3|t3    0.966    0.026   36.803    0.000    0.966    0.966
    stars_int_3|t4    1.730    0.039   43.854    0.000    1.730    1.730
    stars_nt_11|t1   -0.708    0.024  -29.251    0.000   -0.708   -0.708
    stars_nt_11|t2    0.038    0.022    1.742    0.081    0.038    0.038
    stars_nt_11|t3    0.703    0.024   29.083    0.000    0.703    0.703
    stars_nt_11|t4    1.476    0.033   44.110    0.000    1.476    1.476
    stars_int_1|t1   -0.707    0.024  -29.217    0.000   -0.707   -0.707
    stars_int_1|t2    0.010    0.022    0.440    0.660    0.010    0.010
    stars_int_1|t3    0.725    0.024   29.820    0.000    0.725    0.725
    stars_int_1|t4    1.478    0.033   44.119    0.000    1.478    1.478
    stars_int_6|t1   -0.770    0.025  -31.282    0.000   -0.770   -0.770
    stars_int_6|t2    0.004    0.022    0.194    0.846    0.004    0.004
    stars_int_6|t3    0.716    0.024   29.519    0.000    0.716    0.716
    stars_int_6|t4    1.528    0.035   44.261    0.000    1.528    1.528
    stars_int_5|t1    0.240    0.022   10.779    0.000    0.240    0.240
    stars_int_5|t2    0.878    0.025   34.496    0.000    0.878    0.878
    stars_int_5|t3    1.410    0.032   43.763    0.000    1.410    1.410
    stars_int_5|t4    1.938    0.046   41.957    0.000    1.938    1.938
    stars_int_2|t1   -0.815    0.025  -32.660    0.000   -0.815   -0.815
    stars_int_2|t2   -0.027    0.022   -1.214    0.225   -0.027   -0.027
    stars_int_2|t3    0.773    0.025   31.381    0.000    0.773    0.773
    stars_int_2|t4    1.611    0.036   44.279    0.000    1.611    1.611
    stars_tst_3|t1   -1.858    0.043  -42.848    0.000   -1.858   -1.858
    stars_tst_3|t2   -1.127    0.028  -40.257    0.000   -1.127   -1.127
    stars_tst_3|t3   -0.521    0.023  -22.464    0.000   -0.521   -0.521
    stars_tst_3|t4    0.219    0.022    9.832    0.000    0.219    0.219
    stars_tst_6|t1   -1.480    0.034  -44.127    0.000   -1.480   -1.480
    stars_tst_6|t2   -0.740    0.024  -30.320    0.000   -0.740   -0.740
    stars_tst_6|t3   -0.182    0.022   -8.215    0.000   -0.182   -0.182
    stars_tst_6|t4    0.497    0.023   21.531    0.000    0.497    0.497
    stars_tst_4|t1   -1.487    0.034  -44.153    0.000   -1.487   -1.487
    stars_tst_4|t2   -0.806    0.025  -32.399    0.000   -0.806   -0.806
    stars_tst_4|t3   -0.246    0.022  -11.025    0.000   -0.246   -0.246
    stars_tst_4|t4    0.499    0.023   21.600    0.000    0.499    0.499
    stars_tst_1|t1   -1.193    0.029  -41.362    0.000   -1.193   -1.193
    stars_tst_1|t2   -0.469    0.023  -20.422    0.000   -0.469   -0.469
    stars_tst_1|t3    0.209    0.022    9.375    0.000    0.209    0.209
    stars_tst_1|t4    1.008    0.027   37.795    0.000    1.008    1.008
    stars_tst_8|t1   -0.785    0.025  -31.743    0.000   -0.785   -0.785
    stars_tst_8|t2   -0.178    0.022   -8.040    0.000   -0.178   -0.178
    stars_tst_8|t3    0.359    0.023   15.894    0.000    0.359    0.359
    stars_tst_8|t4    1.027    0.027   38.237    0.000    1.027    1.027
    stars_hlp_2|t1   -0.762    0.025  -31.017    0.000   -0.762   -0.762
    stars_hlp_2|t2   -0.024    0.022   -1.109    0.267   -0.024   -0.024
    stars_hlp_2|t3    0.566    0.023   24.186    0.000    0.566    0.566
    stars_hlp_2|t4    1.278    0.030   42.533    0.000    1.278    1.278
    stars_hlp_3|t1   -0.580    0.023  -24.701    0.000   -0.580   -0.580
    stars_hlp_3|t2    0.168    0.022    7.583    0.000    0.168    0.168
    stars_hlp_3|t3    0.794    0.025   32.039    0.000    0.794    0.794
    stars_hlp_3|t4    1.575    0.036   44.303    0.000    1.575    1.575
    stars_hlp_4|t1   -0.237    0.022  -10.639    0.000   -0.237   -0.237
    stars_hlp_4|t2    0.572    0.023   24.427    0.000    0.572    0.572
    stars_hlp_4|t3    1.210    0.029   41.629    0.000    1.210    1.210
    stars_hlp_4|t4    1.927    0.046   42.080    0.000    1.927    1.927

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.299                               0.299    0.299
   .stars_int_11      0.237                               0.237    0.237
   .stars_int_1       0.273                               0.273    0.273
   .stars_int_6       0.349                               0.349    0.349
   .stars_int_5       0.603                               0.603    0.603
   .stars_int_2       0.434                               0.434    0.434
   .stars_test_3      0.199                               0.199    0.199
   .stars_test_6      0.334                               0.334    0.334
   .stars_test_4      0.323                               0.323    0.323
   .stars_test_1      0.280                               0.280    0.280
   .stars_test_8      0.515                               0.515    0.515
   .stars_help_2      0.305                               0.305    0.305
   .stars_help_3      0.230                               0.230    0.230
   .stars_help_4      0.315                               0.315    0.315
    interpret         0.701    0.011   61.703    0.000    1.000    1.000
    test              0.801    0.011   74.073    0.000    1.000    1.000
    help              0.695    0.013   51.611    0.000    1.000    1.000


Group 2 [English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    stars_int_3       1.000                               0.838    0.838
    stars_int_11      1.015    0.009  111.755    0.000    0.851    0.851
    stars_int_1       1.011    0.009  111.297    0.000    0.847    0.847
    stars_int_6       0.949    0.010   92.954    0.000    0.795    0.795
    stars_int_5       0.711    0.015   46.829    0.000    0.596    0.596
    stars_int_2       0.953    0.010   92.724    0.000    0.799    0.799
  test =~                                                               
    stars_test_3      1.000                               0.878    0.878
    stars_test_6      0.956    0.010   99.683    0.000    0.840    0.840
    stars_test_4      0.952    0.009  101.263    0.000    0.836    0.836
    stars_test_1      0.969    0.009  104.149    0.000    0.851    0.851
    stars_test_8      0.766    0.013   60.328    0.000    0.673    0.673
  help =~                                                               
    stars_help_2      1.000                               0.849    0.849
    stars_help_3      1.067    0.011   99.214    0.000    0.906    0.906
    stars_help_4      1.017    0.011   96.624    0.000    0.863    0.863

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.487    0.011   46.356    0.000    0.662    0.662
    help              0.439    0.011   41.307    0.000    0.617    0.617
  test ~~                                                               
    help              0.450    0.011   39.778    0.000    0.604    0.604

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stars_int_3|t1   -0.756    0.023  -32.740    0.000   -0.756   -0.756
    stars_int_3|t2    0.071    0.021    3.397    0.001    0.071    0.071
    stars_int_3|t3    0.802    0.023   34.291    0.000    0.802    0.802
    stars_int_3|t4    1.535    0.033   47.027    0.000    1.535    1.535
    stars_nt_11|t1   -1.063    0.026  -41.430    0.000   -1.063   -1.063
    stars_nt_11|t2   -0.238    0.021  -11.340    0.000   -0.238   -0.238
    stars_nt_11|t3    0.434    0.022   20.197    0.000    0.434    0.434
    stars_nt_11|t4    1.171    0.027   43.565    0.000    1.171    1.171
    stars_int_1|t1   -0.986    0.025  -39.609    0.000   -0.986   -0.986
    stars_int_1|t2   -0.148    0.021   -7.073    0.000   -0.148   -0.148
    stars_int_1|t3    0.529    0.022   24.209    0.000    0.529    0.529
    stars_int_1|t4    1.271    0.028   45.093    0.000    1.271    1.271
    stars_int_6|t1   -0.972    0.025  -39.240    0.000   -0.972   -0.972
    stars_int_6|t2   -0.136    0.021   -6.543    0.000   -0.136   -0.136
    stars_int_6|t3    0.545    0.022   24.858    0.000    0.545    0.545
    stars_int_6|t4    1.279    0.028   45.193    0.000    1.279    1.279
    stars_int_5|t1   -0.171    0.021   -8.165    0.000   -0.171   -0.171
    stars_int_5|t2    0.493    0.022   22.680    0.000    0.493    0.493
    stars_int_5|t3    1.094    0.026   42.090    0.000    1.094    1.094
    stars_int_5|t4    1.719    0.037   46.648    0.000    1.719    1.719
    stars_int_2|t1   -1.053    0.026  -41.215    0.000   -1.053   -1.053
    stars_int_2|t2   -0.203    0.021   -9.720    0.000   -0.203   -0.203
    stars_int_2|t3    0.483    0.022   22.289    0.000    0.483    0.483
    stars_int_2|t4    1.320    0.029   45.686    0.000    1.320    1.320
    stars_tst_3|t1   -2.070    0.049  -42.583    0.000   -2.070   -2.070
    stars_tst_3|t2   -1.286    0.028  -45.292    0.000   -1.286   -1.286
    stars_tst_3|t3   -0.617    0.022  -27.698    0.000   -0.617   -0.617
    stars_tst_3|t4    0.191    0.021    9.125    0.000    0.191    0.191
    stars_tst_6|t1   -1.817    0.040  -45.915    0.000   -1.817   -1.817
    stars_tst_6|t2   -1.043    0.025  -40.971    0.000   -1.043   -1.043
    stars_tst_6|t3   -0.430    0.021  -20.000    0.000   -0.430   -0.430
    stars_tst_6|t4    0.327    0.021   15.432    0.000    0.327    0.327
    stars_tst_4|t1   -1.817    0.040  -45.915    0.000   -1.817   -1.817
    stars_tst_4|t2   -1.105    0.026  -42.323    0.000   -1.105   -1.105
    stars_tst_4|t3   -0.521    0.022  -23.852    0.000   -0.521   -0.521
    stars_tst_4|t4    0.238    0.021   11.340    0.000    0.238    0.238
    stars_tst_1|t1   -1.513    0.032  -46.981    0.000   -1.513   -1.513
    stars_tst_1|t2   -0.696    0.023  -30.633    0.000   -0.696   -0.696
    stars_tst_1|t3   -0.009    0.021   -0.414    0.679   -0.009   -0.009
    stars_tst_1|t4    0.789    0.023   33.859    0.000    0.789    0.789
    stars_tst_8|t1   -1.210    0.027  -44.215    0.000   -1.210   -1.210
    stars_tst_8|t2   -0.465    0.022  -21.505    0.000   -0.465   -0.465
    stars_tst_8|t3    0.149    0.021    7.139    0.000    0.149    0.149
    stars_tst_8|t4    0.882    0.024   36.751    0.000    0.882    0.882
    stars_hlp_2|t1   -0.949    0.025  -38.636    0.000   -0.949   -0.949
    stars_hlp_2|t2   -0.195    0.021   -9.323    0.000   -0.195   -0.195
    stars_hlp_2|t3    0.365    0.021   17.144    0.000    0.365    0.365
    stars_hlp_2|t4    1.069    0.026   41.563    0.000    1.069    1.069
    stars_hlp_3|t1   -0.981    0.025  -39.468    0.000   -0.981   -0.981
    stars_hlp_3|t2   -0.184    0.021   -8.794    0.000   -0.184   -0.184
    stars_hlp_3|t3    0.409    0.021   19.115    0.000    0.409    0.409
    stars_hlp_3|t4    1.144    0.027   43.079    0.000    1.144    1.144
    stars_hlp_4|t1   -0.773    0.023  -33.332    0.000   -0.773   -0.773
    stars_hlp_4|t2    0.075    0.021    3.629    0.000    0.075    0.075
    stars_hlp_4|t3    0.692    0.023   30.506    0.000    0.692    0.692
    stars_hlp_4|t4    1.407    0.030   46.470    0.000    1.407    1.407

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.298                               0.298    0.298
   .stars_int_11      0.277                               0.277    0.277
   .stars_int_1       0.282                               0.282    0.282
   .stars_int_6       0.367                               0.367    0.367
   .stars_int_5       0.645                               0.645    0.645
   .stars_int_2       0.362                               0.362    0.362
   .stars_test_3      0.229                               0.229    0.229
   .stars_test_6      0.295                               0.295    0.295
   .stars_test_4      0.302                               0.302    0.302
   .stars_test_1      0.275                               0.275    0.275
   .stars_test_8      0.548                               0.548    0.548
   .stars_help_2      0.280                               0.280    0.280
   .stars_help_3      0.179                               0.179    0.179
   .stars_help_4      0.255                               0.255    0.255
    interpret         0.702    0.010   69.268    0.000    1.000    1.000
    test              0.771    0.010   73.893    0.000    1.000    1.000
    help              0.720    0.011   63.576    0.000    1.000    1.000

🤔 Can the same measurement model can be meaningfully applied across groups?

Solution

✔️ The configural model shows acceptable fit across groups, with strong and consistent factor loadings and no estimation problems. This suggests that the same factor structure is plausible in both groups, providing a suitable baseline for testing measurement invariance.

Specifically:

The model ran properly (“lavaan ended normally after 42 iterations”)
Overall model fit was acceptable (CFI ≈ .95; TLI ≈ .94; RMSEA ≈ .08; SRMR ≈ .045)
Group specific χ² showed different levels of misfit across groups (Non-English: χ² ≈ 2208; English: χ² ≈ 1633), but nothing dramatic or alarming
Factor loadings are mostly strong (.75–.90) and all are positive so items are good indicators of their factors in both groups, and the measurement structure is consistent (if they weren’t you should consider removing problematic items)
Factor correlations are moderate to strong (~.60–.65), so are related but distinct
Thresholds were successfully estimated (you’re not interpreting them yet — just confirming the model behaves)
Residual variances are all positive (no Heywood cases)

The same factor structure can be estimated across groups without imposing equality constraints. This supports configural invariance, indicating that the construct has a similar conceptual meaning in both groups.

At this stage, we can conclude that the overall pattern of relationships between items and factors is consistent across groups. However, comparisons of relationships or means are not yet meaningful, as the scale of the construct may still differ between groups.

Step 2: Metric Invariance

The key question at this stage is whether the relationships between items and their underlying factors are equivalent across groups.

Metric invariance tests whether factor loadings can be constrained to be equal across groups, meaning that each item contributes to the latent construct to the same extent in each group. This is important because it determines whether the construct has the same meaning across groups.

To evaluate this, we fit a multi-group CFA model in which factor loadings are constrained to be equal across groups, while other parameters remain freely estimated. We then compare this model to the configural model using changes in fit indices.

If model fit does not deteriorate substantially, this supports metric invariance, indicating that the items relate to the latent variables in a comparable way across groups. However, if model fit worsens notably, this suggests that some items function differently across groups, and full metric invariance may not hold. In such cases, it may be necessary to identify and relax constraints on specific loadings.

fit_metric <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
  group = "language",
1  group.equal = "loadings",
  estimator = "WLSMV",
  ordered = item_vars
)

lavaan::summary(fit_metric, fit.measures = TRUE, standardized = TRUE)

1: This argument constrains the loadings to be equal. Otherwise our code is as per the configural model.

lavaan 0.6-19 ended normally after 42 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                       146
  Number of equality constraints                    11

  Number of observations per group:               Used       Total
    Non-English                                   3227        3236
    English                                       3641        3649

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                              2396.186    3102.296
  Degrees of freedom                               159         159
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  0.785
  Shift parameter                                           49.051
    simple second-order correction                                
  Test statistic for each group:
    Non-English                               1779.259    1779.259
    English                                   1323.037    1323.037

Model Test Baseline Model:

  Test statistic                            377816.121  138627.992
  Degrees of freedom                               182         182
  P-value                                        0.000       0.000
  Scaling correction factor                                  2.728

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.994       0.979
  Tucker-Lewis Index (TLI)                       0.993       0.976
                                                                  
  Robust Comparative Fit Index (CFI)                         0.951
  Robust Tucker-Lewis Index (TLI)                            0.944

Root Mean Square Error of Approximation:

  RMSEA                                          0.064       0.073
  90 Percent confidence interval - lower         0.062       0.071
  90 Percent confidence interval - upper         0.066       0.076
  P-value H_0: RMSEA <= 0.050                    0.000       0.000
  P-value H_0: RMSEA >= 0.080                    0.000       0.000
                                                                  
  Robust RMSEA                                               0.078
  90 Percent confidence interval - lower                     0.075
  90 Percent confidence interval - upper                     0.081
  P-value H_0: Robust RMSEA <= 0.050                         0.000
  P-value H_0: Robust RMSEA >= 0.080                         0.121

Standardized Root Mean Square Residual:

  SRMR                                           0.046       0.046

Parameter Estimates:

  Parameterization                               Delta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured


Group 1 [Non-English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    strs__3           1.000                               0.840    0.840
    str__11 (.p2.)    1.028    0.007  153.095    0.000    0.863    0.863
    strs__1 (.p3.)    1.015    0.007  153.912    0.000    0.852    0.852
    strs__6 (.p4.)    0.956    0.008  127.113    0.000    0.803    0.803
    strs__5 (.p5.)    0.729    0.011   65.270    0.000    0.612    0.612
    strs__2 (.p6.)    0.929    0.008  117.972    0.000    0.780    0.780
  test =~                                                               
    strs__3           1.000                               0.884    0.884
    strs__6 (.p8.)    0.936    0.007  137.147    0.000    0.828    0.828
    strs__4 (.p9.)    0.937    0.007  136.902    0.000    0.829    0.829
    strs__1 (.10.)    0.960    0.007  143.569    0.000    0.849    0.849
    strs__8 (.11.)    0.771    0.009   82.455    0.000    0.682    0.682
  help =~                                                               
    strs__2           1.000                               0.827    0.827
    strs__3 (.13.)    1.062    0.009  122.765    0.000    0.878    0.878
    strs__4 (.14.)    1.007    0.008  120.123    0.000    0.833    0.833

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.484    0.011   43.576    0.000    0.651    0.651
    help              0.441    0.011   39.553    0.000    0.635    0.635
  test ~~                                                               
    help              0.437    0.012   37.161    0.000    0.597    0.597

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stars_int_3|t1   -0.368    0.023  -16.278    0.000   -0.368   -0.368
    stars_int_3|t2    0.320    0.022   14.250    0.000    0.320    0.320
    stars_int_3|t3    0.966    0.026   36.803    0.000    0.966    0.966
    stars_int_3|t4    1.730    0.039   43.854    0.000    1.730    1.730
    stars_nt_11|t1   -0.708    0.024  -29.251    0.000   -0.708   -0.708
    stars_nt_11|t2    0.038    0.022    1.742    0.081    0.038    0.038
    stars_nt_11|t3    0.703    0.024   29.083    0.000    0.703    0.703
    stars_nt_11|t4    1.476    0.033   44.110    0.000    1.476    1.476
    stars_int_1|t1   -0.707    0.024  -29.217    0.000   -0.707   -0.707
    stars_int_1|t2    0.010    0.022    0.440    0.660    0.010    0.010
    stars_int_1|t3    0.725    0.024   29.820    0.000    0.725    0.725
    stars_int_1|t4    1.478    0.033   44.119    0.000    1.478    1.478
    stars_int_6|t1   -0.770    0.025  -31.282    0.000   -0.770   -0.770
    stars_int_6|t2    0.004    0.022    0.194    0.846    0.004    0.004
    stars_int_6|t3    0.716    0.024   29.519    0.000    0.716    0.716
    stars_int_6|t4    1.528    0.035   44.261    0.000    1.528    1.528
    stars_int_5|t1    0.240    0.022   10.779    0.000    0.240    0.240
    stars_int_5|t2    0.878    0.025   34.496    0.000    0.878    0.878
    stars_int_5|t3    1.410    0.032   43.763    0.000    1.410    1.410
    stars_int_5|t4    1.938    0.046   41.957    0.000    1.938    1.938
    stars_int_2|t1   -0.815    0.025  -32.660    0.000   -0.815   -0.815
    stars_int_2|t2   -0.027    0.022   -1.214    0.225   -0.027   -0.027
    stars_int_2|t3    0.773    0.025   31.381    0.000    0.773    0.773
    stars_int_2|t4    1.611    0.036   44.279    0.000    1.611    1.611
    stars_tst_3|t1   -1.858    0.043  -42.848    0.000   -1.858   -1.858
    stars_tst_3|t2   -1.127    0.028  -40.257    0.000   -1.127   -1.127
    stars_tst_3|t3   -0.521    0.023  -22.464    0.000   -0.521   -0.521
    stars_tst_3|t4    0.219    0.022    9.832    0.000    0.219    0.219
    stars_tst_6|t1   -1.480    0.034  -44.127    0.000   -1.480   -1.480
    stars_tst_6|t2   -0.740    0.024  -30.320    0.000   -0.740   -0.740
    stars_tst_6|t3   -0.182    0.022   -8.215    0.000   -0.182   -0.182
    stars_tst_6|t4    0.497    0.023   21.531    0.000    0.497    0.497
    stars_tst_4|t1   -1.487    0.034  -44.153    0.000   -1.487   -1.487
    stars_tst_4|t2   -0.806    0.025  -32.399    0.000   -0.806   -0.806
    stars_tst_4|t3   -0.246    0.022  -11.025    0.000   -0.246   -0.246
    stars_tst_4|t4    0.499    0.023   21.600    0.000    0.499    0.499
    stars_tst_1|t1   -1.193    0.029  -41.362    0.000   -1.193   -1.193
    stars_tst_1|t2   -0.469    0.023  -20.422    0.000   -0.469   -0.469
    stars_tst_1|t3    0.209    0.022    9.375    0.000    0.209    0.209
    stars_tst_1|t4    1.008    0.027   37.795    0.000    1.008    1.008
    stars_tst_8|t1   -0.785    0.025  -31.743    0.000   -0.785   -0.785
    stars_tst_8|t2   -0.178    0.022   -8.040    0.000   -0.178   -0.178
    stars_tst_8|t3    0.359    0.023   15.894    0.000    0.359    0.359
    stars_tst_8|t4    1.027    0.027   38.237    0.000    1.027    1.027
    stars_hlp_2|t1   -0.762    0.025  -31.017    0.000   -0.762   -0.762
    stars_hlp_2|t2   -0.024    0.022   -1.109    0.267   -0.024   -0.024
    stars_hlp_2|t3    0.566    0.023   24.186    0.000    0.566    0.566
    stars_hlp_2|t4    1.278    0.030   42.533    0.000    1.278    1.278
    stars_hlp_3|t1   -0.580    0.023  -24.701    0.000   -0.580   -0.580
    stars_hlp_3|t2    0.168    0.022    7.583    0.000    0.168    0.168
    stars_hlp_3|t3    0.794    0.025   32.039    0.000    0.794    0.794
    stars_hlp_3|t4    1.575    0.036   44.303    0.000    1.575    1.575
    stars_hlp_4|t1   -0.237    0.022  -10.639    0.000   -0.237   -0.237
    stars_hlp_4|t2    0.572    0.023   24.427    0.000    0.572    0.572
    stars_hlp_4|t3    1.210    0.029   41.629    0.000    1.210    1.210
    stars_hlp_4|t4    1.927    0.046   42.080    0.000    1.927    1.927

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.295                               0.295    0.295
   .stars_int_11      0.255                               0.255    0.255
   .stars_int_1       0.274                               0.274    0.274
   .stars_int_6       0.356                               0.356    0.356
   .stars_int_5       0.625                               0.625    0.625
   .stars_int_2       0.391                               0.391    0.391
   .stars_test_3      0.218                               0.218    0.218
   .stars_test_6      0.315                               0.315    0.315
   .stars_test_4      0.313                               0.313    0.313
   .stars_test_1      0.279                               0.279    0.279
   .stars_test_8      0.534                               0.534    0.534
   .stars_help_2      0.316                               0.316    0.316
   .stars_help_3      0.229                               0.229    0.229
   .stars_help_4      0.306                               0.306    0.306
    interpret         0.705    0.009   75.730    0.000    1.000    1.000
    test              0.782    0.009   86.273    0.000    1.000    1.000
    help              0.684    0.010   69.690    0.000    1.000    1.000


Group 2 [English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    strs__3           1.000                               0.836    0.836
    str__11 (.p2.)    1.028    0.007  153.095    0.000    0.859    0.859
    strs__1 (.p3.)    1.015    0.007  153.912    0.000    0.848    0.848
    strs__6 (.p4.)    0.956    0.008  127.113    0.000    0.799    0.799
    strs__5 (.p5.)    0.729    0.011   65.270    0.000    0.610    0.610
    strs__2 (.p6.)    0.929    0.008  117.972    0.000    0.777    0.777
  test =~                                                               
    strs__3           1.000                               0.887    0.887
    strs__6 (.p8.)    0.936    0.007  137.147    0.000    0.830    0.830
    strs__4 (.p9.)    0.937    0.007  136.902    0.000    0.831    0.831
    strs__1 (.10.)    0.960    0.007  143.569    0.000    0.851    0.851
    strs__8 (.11.)    0.771    0.009   82.455    0.000    0.684    0.684
  help =~                                                               
    strs__2           1.000                               0.853    0.853
    strs__3 (.13.)    1.062    0.009  122.765    0.000    0.906    0.906
    strs__4 (.14.)    1.007    0.008  120.123    0.000    0.859    0.859

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.491    0.010   46.894    0.000    0.662    0.662
    help              0.440    0.010   42.466    0.000    0.617    0.617
  test ~~                                                               
    help              0.457    0.011   40.610    0.000    0.604    0.604

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stars_int_3|t1   -0.756    0.023  -32.740    0.000   -0.756   -0.756
    stars_int_3|t2    0.071    0.021    3.397    0.001    0.071    0.071
    stars_int_3|t3    0.802    0.023   34.291    0.000    0.802    0.802
    stars_int_3|t4    1.535    0.033   47.027    0.000    1.535    1.535
    stars_nt_11|t1   -1.063    0.026  -41.430    0.000   -1.063   -1.063
    stars_nt_11|t2   -0.238    0.021  -11.340    0.000   -0.238   -0.238
    stars_nt_11|t3    0.434    0.022   20.197    0.000    0.434    0.434
    stars_nt_11|t4    1.171    0.027   43.565    0.000    1.171    1.171
    stars_int_1|t1   -0.986    0.025  -39.609    0.000   -0.986   -0.986
    stars_int_1|t2   -0.148    0.021   -7.073    0.000   -0.148   -0.148
    stars_int_1|t3    0.529    0.022   24.209    0.000    0.529    0.529
    stars_int_1|t4    1.271    0.028   45.093    0.000    1.271    1.271
    stars_int_6|t1   -0.972    0.025  -39.240    0.000   -0.972   -0.972
    stars_int_6|t2   -0.136    0.021   -6.543    0.000   -0.136   -0.136
    stars_int_6|t3    0.545    0.022   24.858    0.000    0.545    0.545
    stars_int_6|t4    1.279    0.028   45.193    0.000    1.279    1.279
    stars_int_5|t1   -0.171    0.021   -8.165    0.000   -0.171   -0.171
    stars_int_5|t2    0.493    0.022   22.680    0.000    0.493    0.493
    stars_int_5|t3    1.094    0.026   42.090    0.000    1.094    1.094
    stars_int_5|t4    1.719    0.037   46.648    0.000    1.719    1.719
    stars_int_2|t1   -1.053    0.026  -41.215    0.000   -1.053   -1.053
    stars_int_2|t2   -0.203    0.021   -9.720    0.000   -0.203   -0.203
    stars_int_2|t3    0.483    0.022   22.289    0.000    0.483    0.483
    stars_int_2|t4    1.320    0.029   45.686    0.000    1.320    1.320
    stars_tst_3|t1   -2.070    0.049  -42.583    0.000   -2.070   -2.070
    stars_tst_3|t2   -1.286    0.028  -45.292    0.000   -1.286   -1.286
    stars_tst_3|t3   -0.617    0.022  -27.698    0.000   -0.617   -0.617
    stars_tst_3|t4    0.191    0.021    9.125    0.000    0.191    0.191
    stars_tst_6|t1   -1.817    0.040  -45.915    0.000   -1.817   -1.817
    stars_tst_6|t2   -1.043    0.025  -40.971    0.000   -1.043   -1.043
    stars_tst_6|t3   -0.430    0.021  -20.000    0.000   -0.430   -0.430
    stars_tst_6|t4    0.327    0.021   15.432    0.000    0.327    0.327
    stars_tst_4|t1   -1.817    0.040  -45.915    0.000   -1.817   -1.817
    stars_tst_4|t2   -1.105    0.026  -42.323    0.000   -1.105   -1.105
    stars_tst_4|t3   -0.521    0.022  -23.852    0.000   -0.521   -0.521
    stars_tst_4|t4    0.238    0.021   11.340    0.000    0.238    0.238
    stars_tst_1|t1   -1.513    0.032  -46.981    0.000   -1.513   -1.513
    stars_tst_1|t2   -0.696    0.023  -30.633    0.000   -0.696   -0.696
    stars_tst_1|t3   -0.009    0.021   -0.414    0.679   -0.009   -0.009
    stars_tst_1|t4    0.789    0.023   33.859    0.000    0.789    0.789
    stars_tst_8|t1   -1.210    0.027  -44.215    0.000   -1.210   -1.210
    stars_tst_8|t2   -0.465    0.022  -21.505    0.000   -0.465   -0.465
    stars_tst_8|t3    0.149    0.021    7.139    0.000    0.149    0.149
    stars_tst_8|t4    0.882    0.024   36.751    0.000    0.882    0.882
    stars_hlp_2|t1   -0.949    0.025  -38.636    0.000   -0.949   -0.949
    stars_hlp_2|t2   -0.195    0.021   -9.323    0.000   -0.195   -0.195
    stars_hlp_2|t3    0.365    0.021   17.144    0.000    0.365    0.365
    stars_hlp_2|t4    1.069    0.026   41.563    0.000    1.069    1.069
    stars_hlp_3|t1   -0.981    0.025  -39.468    0.000   -0.981   -0.981
    stars_hlp_3|t2   -0.184    0.021   -8.794    0.000   -0.184   -0.184
    stars_hlp_3|t3    0.409    0.021   19.115    0.000    0.409    0.409
    stars_hlp_3|t4    1.144    0.027   43.079    0.000    1.144    1.144
    stars_hlp_4|t1   -0.773    0.023  -33.332    0.000   -0.773   -0.773
    stars_hlp_4|t2    0.075    0.021    3.629    0.000    0.075    0.075
    stars_hlp_4|t3    0.692    0.023   30.506    0.000    0.692    0.692
    stars_hlp_4|t4    1.407    0.030   46.470    0.000    1.407    1.407

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.301                               0.301    0.301
   .stars_int_11      0.261                               0.261    0.261
   .stars_int_1       0.280                               0.280    0.280
   .stars_int_6       0.361                               0.361    0.361
   .stars_int_5       0.628                               0.628    0.628
   .stars_int_2       0.397                               0.397    0.397
   .stars_test_3      0.213                               0.213    0.213
   .stars_test_6      0.311                               0.311    0.311
   .stars_test_4      0.309                               0.309    0.309
   .stars_test_1      0.275                               0.275    0.275
   .stars_test_8      0.532                               0.532    0.532
   .stars_help_2      0.272                               0.272    0.272
   .stars_help_3      0.180                               0.180    0.180
   .stars_help_4      0.262                               0.262    0.262
    interpret         0.699    0.009   80.572    0.000    1.000    1.000
    test              0.787    0.009   85.827    0.000    1.000    1.000
    help              0.728    0.010   74.849    0.000    1.000    1.000

🤔 Are the relationships between items and their underlying factors are equivalent across groups?

Solution

✔️ The metric invariance model shows acceptable fit, with only minimal changes relative to the configural model. This suggests that factor loadings can be considered equal across groups, meaning the constructs are measured in the same way in both groups.

Specifically:

The model ran properly (“lavaan ended normally after 42 iterations”)
Equality constraints were successfully applied (11 loadings constrained equal across groups)
Model fit remains very similar - changes are tiny and well within recommended thresholds (more on this later). In other words, adding equality constraints on loadings did not meaningfully worsen model fit, supporting metric invariance.
- Robust CFI ≈ .951 (previously ≈ .952)
- Robust TLI ≈ .944 (previously ≈ .941)
- Robust RMSEA ≈ .078 (previously ≈ .080)
- SRMR ≈ .046 (previously ≈ .045)
Group specific χ² showed different levels of misfit across groups (Non-English: χ² ≈ 1779; English: χ² ≈ 1323), but nothing dramatic or alarming (or new)
Factor loadings are now constrained equal across groups so we don’t need to interpret them (we know they’ll be the same) — in this model we’re interpretting the model fit to see if forcing equal loadings reduces model fit substantially (if it does, it suggests loadings aren’t equal).
Factor correlations, thresholds, and residual variances are all interpreted as before.

Imposing equality constraints on factor loadings does not meaningfully reduce model fit. This supports metric invariance, indicating that the relationship between items and the latent construct is equivalent across groups.

At this stage, relationships involving the latent variables (e.g., correlations or regressions) can be meaningfully compared across groups, as the construct is measured on the same scale. However, comparisons of latent means are not yet appropriate.

Step 3: Scalar Invariance

The key question at this stage is whether group differences in observed scores reflect true differences in the underlying constructs.

Scalar invariance tests whether item thresholds (for ordinal data) can be constrained to be equal across groups, in addition to factor loadings. This ensures that individuals with the same level of the latent trait have the same expected responses, regardless of group membership.

To evaluate this, we fit a model in which both loadings and thresholds are constrained to equality across groups, and compare it to the metric model using changes in fit indices.

If model fit remains acceptable, this supports scalar invariance and allows for meaningful comparisons of latent means across groups. However, if model fit deteriorates, this indicates that some items may be systematically easier or harder to endorse in one group than another. In such cases, partial invariance may be considered by freeing selected thresholds.

fit_scalar <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
  group = "language",
1  group.equal = c("loadings", "thresholds"),
  estimator = "WLSMV",
  ordered = item_vars
)

lavaan::summary(fit_scalar, fit.measures = TRUE, standardized = TRUE)

1: This argument constrains the loadings and thresholds to be equal. Otherwise our code is as per the metric model.

lavaan 0.6-19 ended normally after 61 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                       163
  Number of equality constraints                    67

  Number of observations per group:               Used       Total
    Non-English                                   3227        3236
    English                                       3641        3649

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                              2778.649    4190.465
  Degrees of freedom                               198         198
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  0.671
  Shift parameter                                           46.526
    simple second-order correction                                
  Test statistic for each group:
    Non-English                               2376.224    2376.224
    English                                   1814.242    1814.242

Model Test Baseline Model:

  Test statistic                            377816.121  138627.992
  Degrees of freedom                               182         182
  P-value                                        0.000       0.000
  Scaling correction factor                                  2.728

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.993       0.971
  Tucker-Lewis Index (TLI)                       0.994       0.973
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.062       0.077
  90 Percent confidence interval - lower         0.060       0.075
  90 Percent confidence interval - upper         0.064       0.079
  P-value H_0: RMSEA <= 0.050                    0.000       0.000
  P-value H_0: RMSEA >= 0.080                    0.000       0.003
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                        NA
  90 Percent confidence interval - upper                        NA
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.046       0.046

Parameter Estimates:

  Parameterization                               Delta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured


Group 1 [Non-English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    strs__3           1.000                               0.833    0.833
    str__11 (.p2.)    1.048    0.009  115.789    0.000    0.873    0.873
    strs__1 (.p3.)    1.021    0.009  117.498    0.000    0.850    0.850
    strs__6 (.p4.)    0.967    0.010   97.551    0.000    0.805    0.805
    strs__5 (.p5.)    0.768    0.014   55.203    0.000    0.639    0.639
    strs__2 (.p6.)    0.913    0.011   83.891    0.000    0.760    0.760
  test =~                                                               
    strs__3           1.000                               0.898    0.898
    strs__6 (.p8.)    0.908    0.009  100.492    0.000    0.815    0.815
    strs__4 (.p9.)    0.916    0.009   99.114    0.000    0.823    0.823
    strs__1 (.10.)    0.947    0.009  106.834    0.000    0.850    0.850
    strs__8 (.11.)    0.766    0.012   62.205    0.000    0.688    0.688
  help =~                                                               
    strs__2           1.000                               0.823    0.823
    strs__3 (.13.)    1.066    0.013   84.701    0.000    0.877    0.877
    strs__4 (.14.)    1.019    0.012   83.240    0.000    0.839    0.839

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.487    0.011   43.455    0.000    0.651    0.651
    help              0.435    0.011   38.223    0.000    0.635    0.635
  test ~~                                                               
    help              0.441    0.012   36.615    0.000    0.597    0.597

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    st__3|1 (.15.)   -0.425    0.020  -20.943    0.000   -0.425   -0.425
    st__3|2 (.16.)    0.324    0.019   17.054    0.000    0.324    0.324
    st__3|3 (.17.)    1.009    0.022   45.028    0.000    1.009    1.009
    st__3|4 (.18.)    1.747    0.032   54.222    0.000    1.747    1.747
    s__11|1 (.19.)   -0.753    0.023  -33.039    0.000   -0.753   -0.753
    s__11|2 (.20.)    0.034    0.019    1.735    0.083    0.034    0.034
    s__11|3 (.21.)    0.706    0.021   33.772    0.000    0.706    0.706
    s__11|4 (.22.)    1.463    0.028   52.345    0.000    1.463    1.463
    st__1|1 (.23.)   -0.710    0.022  -32.070    0.000   -0.710   -0.710
    st__1|2 (.24.)    0.067    0.019    3.499    0.000    0.067    0.067
    st__1|3 (.25.)    0.759    0.021   36.334    0.000    0.759    0.759
    st__1|4 (.26.)    1.503    0.028   53.400    0.000    1.503    1.503
    st__6|1 (.27.)   -0.747    0.022  -33.329    0.000   -0.747   -0.747
    st__6|2 (.28.)    0.061    0.019    3.271    0.001    0.061    0.061
    st__6|3 (.29.)    0.759    0.020   37.360    0.000    0.759    0.759
    st__6|4 (.30.)    1.528    0.028   54.390    0.000    1.528    1.528
    st__5|1 (.31.)    0.128    0.018    7.080    0.000    0.128    0.128
    st__5|2 (.32.)    0.793    0.020   38.702    0.000    0.793    0.793
    st__5|3 (.33.)    1.383    0.027   51.077    0.000    1.383    1.383
    st__5|4 (.34.)    1.990    0.038   52.030    0.000    1.990    1.990
    st__2|1 (.35.)   -0.788    0.022  -35.702    0.000   -0.788   -0.788
    st__2|2 (.36.)    0.011    0.018    0.630    0.529    0.011    0.011
    st__2|3 (.37.)    0.726    0.020   36.839    0.000    0.726    0.726
    st__2|4 (.38.)    1.531    0.029   53.243    0.000    1.531    1.531
    st__3|1 (.39.)   -1.829    0.038  -48.063    0.000   -1.829   -1.829
    st__3|2 (.40.)   -1.079    0.025  -42.490    0.000   -1.079   -1.079
    st__3|3 (.41.)   -0.445    0.021  -21.710    0.000   -0.445   -0.445
    st__3|4 (.42.)    0.322    0.020   16.253    0.000    0.322    0.322
    st__6|1 (.43.)   -1.474    0.029  -50.201    0.000   -1.474   -1.474
    st__6|2 (.44.)   -0.748    0.021  -35.228    0.000   -0.748   -0.748
    st__6|3 (.45.)   -0.188    0.019  -10.056    0.000   -0.188   -0.188
    st__6|4 (.46.)    0.500    0.020   25.476    0.000    0.500    0.500
    st__4|1 (.47.)   -1.488    0.029  -50.691    0.000   -1.488   -1.488
    st__4|2 (.48.)   -0.816    0.022  -37.532    0.000   -0.816   -0.816
    st__4|3 (.49.)   -0.265    0.019  -13.978    0.000   -0.265   -0.265
    st__4|4 (.50.)    0.456    0.020   23.206    0.000    0.456    0.456
    st__1|1 (.51.)   -1.206    0.026  -46.280    0.000   -1.206   -1.206
    st__1|2 (.52.)   -0.458    0.020  -22.911    0.000   -0.458   -0.458
    st__1|3 (.53.)    0.205    0.019   10.721    0.000    0.205    0.205
    st__1|4 (.54.)    0.978    0.024   41.407    0.000    0.978    0.978
    st__8|1 (.55.)   -0.879    0.022  -40.085    0.000   -0.879   -0.879
    st__8|2 (.56.)   -0.229    0.018  -12.681    0.000   -0.229   -0.229
    st__8|3 (.57.)    0.333    0.018   18.408    0.000    0.333    0.333
    st__8|4 (.58.)    1.018    0.023   44.542    0.000    1.018    1.018
    st__2|1 (.59.)   -0.699    0.023  -30.946    0.000   -0.699   -0.699
    st__2|2 (.60.)    0.062    0.019    3.240    0.001    0.062    0.062
    st__2|3 (.61.)    0.649    0.020   32.210    0.000    0.649    0.649
    st__2|4 (.62.)    1.374    0.026   52.251    0.000    1.374    1.374
    st__3|1 (.63.)   -0.607    0.022  -27.189    0.000   -0.607   -0.607
    st__3|2 (.64.)    0.174    0.020    8.867    0.000    0.174    0.174
    st__3|3 (.65.)    0.792    0.022   36.603    0.000    0.792    0.792
    st__3|4 (.66.)    1.556    0.030   51.553    0.000    1.556    1.556
    st__4|1 (.67.)   -0.334    0.020  -16.368    0.000   -0.334   -0.334
    st__4|2 (.68.)    0.496    0.020   24.852    0.000    0.496    0.496
    st__4|3 (.69.)    1.124    0.024   46.034    0.000    1.124    1.124
    st__4|4 (.70.)    1.846    0.035   52.327    0.000    1.846    1.846

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.306                               0.306    0.306
   .stars_int_11      0.238                               0.238    0.238
   .stars_int_1       0.277                               0.277    0.277
   .stars_int_6       0.351                               0.351    0.351
   .stars_int_5       0.591                               0.591    0.591
   .stars_int_2       0.422                               0.422    0.422
   .stars_test_3      0.194                               0.194    0.194
   .stars_test_6      0.336                               0.336    0.336
   .stars_test_4      0.323                               0.323    0.323
   .stars_test_1      0.277                               0.277    0.277
   .stars_test_8      0.527                               0.527    0.527
   .stars_help_2      0.322                               0.322    0.322
   .stars_help_3      0.230                               0.230    0.230
   .stars_help_4      0.297                               0.297    0.297
    interpret         0.694    0.011   65.542    0.000    1.000    1.000
    test              0.806    0.010   78.047    0.000    1.000    1.000
    help              0.678    0.012   55.924    0.000    1.000    1.000


Group 2 [English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    strs__3           1.000                               0.823    0.842
    str__11 (.p2.)    1.048    0.009  115.789    0.000    0.862    0.852
    strs__1 (.p3.)    1.021    0.009  117.498    0.000    0.840    0.848
    strs__6 (.p4.)    0.967    0.010   97.551    0.000    0.795    0.795
    strs__5 (.p5.)    0.768    0.014   55.203    0.000    0.631    0.593
    strs__2 (.p6.)    0.913    0.011   83.891    0.000    0.751    0.792
  test =~                                                               
    strs__3           1.000                               0.850    0.872
    strs__6 (.p8.)    0.908    0.009  100.492    0.000    0.772    0.841
    strs__4 (.p9.)    0.916    0.009   99.114    0.000    0.779    0.837
    strs__1 (.10.)    0.947    0.009  106.834    0.000    0.805    0.851
    strs__8 (.11.)    0.766    0.012   62.205    0.000    0.651    0.683
  help =~                                                               
    strs__2           1.000                               0.875    0.848
    strs__3 (.13.)    1.066    0.013   84.701    0.000    0.932    0.907
    strs__4 (.14.)    1.019    0.012   83.240    0.000    0.891    0.862

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.463    0.019   24.458    0.000    0.662    0.662
    help              0.444    0.019   23.282    0.000    0.618    0.618
  test ~~                                                               
    help              0.449    0.020   22.958    0.000    0.604    0.604

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    interpret         0.258    0.021   12.124    0.000    0.313    0.313
    test              0.222    0.023    9.631    0.000    0.261    0.261
    help              0.347    0.023   15.377    0.000    0.396    0.396

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    st__3|1 (.15.)   -0.425    0.020  -20.943    0.000   -0.425   -0.435
    st__3|2 (.16.)    0.324    0.019   17.054    0.000    0.324    0.331
    st__3|3 (.17.)    1.009    0.022   45.028    0.000    1.009    1.032
    st__3|4 (.18.)    1.747    0.032   54.222    0.000    1.747    1.787
    s__11|1 (.19.)   -0.753    0.023  -33.039    0.000   -0.753   -0.744
    s__11|2 (.20.)    0.034    0.019    1.735    0.083    0.034    0.033
    s__11|3 (.21.)    0.706    0.021   33.772    0.000    0.706    0.698
    s__11|4 (.22.)    1.463    0.028   52.345    0.000    1.463    1.446
    st__1|1 (.23.)   -0.710    0.022  -32.070    0.000   -0.710   -0.717
    st__1|2 (.24.)    0.067    0.019    3.499    0.000    0.067    0.068
    st__1|3 (.25.)    0.759    0.021   36.334    0.000    0.759    0.767
    st__1|4 (.26.)    1.503    0.028   53.400    0.000    1.503    1.519
    st__6|1 (.27.)   -0.747    0.022  -33.329    0.000   -0.747   -0.746
    st__6|2 (.28.)    0.061    0.019    3.271    0.001    0.061    0.061
    st__6|3 (.29.)    0.759    0.020   37.360    0.000    0.759    0.759
    st__6|4 (.30.)    1.528    0.028   54.390    0.000    1.528    1.528
    st__5|1 (.31.)    0.128    0.018    7.080    0.000    0.128    0.120
    st__5|2 (.32.)    0.793    0.020   38.702    0.000    0.793    0.744
    st__5|3 (.33.)    1.383    0.027   51.077    0.000    1.383    1.298
    st__5|4 (.34.)    1.990    0.038   52.030    0.000    1.990    1.868
    st__2|1 (.35.)   -0.788    0.022  -35.702    0.000   -0.788   -0.832
    st__2|2 (.36.)    0.011    0.018    0.630    0.529    0.011    0.012
    st__2|3 (.37.)    0.726    0.020   36.839    0.000    0.726    0.766
    st__2|4 (.38.)    1.531    0.029   53.243    0.000    1.531    1.616
    st__3|1 (.39.)   -1.829    0.038  -48.063    0.000   -1.829   -1.877
    st__3|2 (.40.)   -1.079    0.025  -42.490    0.000   -1.079   -1.107
    st__3|3 (.41.)   -0.445    0.021  -21.710    0.000   -0.445   -0.457
    st__3|4 (.42.)    0.322    0.020   16.253    0.000    0.322    0.330
    st__6|1 (.43.)   -1.474    0.029  -50.201    0.000   -1.474   -1.605
    st__6|2 (.44.)   -0.748    0.021  -35.228    0.000   -0.748   -0.815
    st__6|3 (.45.)   -0.188    0.019  -10.056    0.000   -0.188   -0.205
    st__6|4 (.46.)    0.500    0.020   25.476    0.000    0.500    0.544
    st__4|1 (.47.)   -1.488    0.029  -50.691    0.000   -1.488   -1.598
    st__4|2 (.48.)   -0.816    0.022  -37.532    0.000   -0.816   -0.876
    st__4|3 (.49.)   -0.265    0.019  -13.978    0.000   -0.265   -0.285
    st__4|4 (.50.)    0.456    0.020   23.206    0.000    0.456    0.490
    st__1|1 (.51.)   -1.206    0.026  -46.280    0.000   -1.206   -1.275
    st__1|2 (.52.)   -0.458    0.020  -22.911    0.000   -0.458   -0.484
    st__1|3 (.53.)    0.205    0.019   10.721    0.000    0.205    0.217
    st__1|4 (.54.)    0.978    0.024   41.407    0.000    0.978    1.033
    st__8|1 (.55.)   -0.879    0.022  -40.085    0.000   -0.879   -0.921
    st__8|2 (.56.)   -0.229    0.018  -12.681    0.000   -0.229   -0.240
    st__8|3 (.57.)    0.333    0.018   18.408    0.000    0.333    0.349
    st__8|4 (.58.)    1.018    0.023   44.542    0.000    1.018    1.067
    st__2|1 (.59.)   -0.699    0.023  -30.946    0.000   -0.699   -0.678
    st__2|2 (.60.)    0.062    0.019    3.240    0.001    0.062    0.061
    st__2|3 (.61.)    0.649    0.020   32.210    0.000    0.649    0.630
    st__2|4 (.62.)    1.374    0.026   52.251    0.000    1.374    1.332
    st__3|1 (.63.)   -0.607    0.022  -27.189    0.000   -0.607   -0.590
    st__3|2 (.64.)    0.174    0.020    8.867    0.000    0.174    0.170
    st__3|3 (.65.)    0.792    0.022   36.603    0.000    0.792    0.770
    st__3|4 (.66.)    1.556    0.030   51.553    0.000    1.556    1.514
    st__4|1 (.67.)   -0.334    0.020  -16.368    0.000   -0.334   -0.323
    st__4|2 (.68.)    0.496    0.020   24.852    0.000    0.496    0.479
    st__4|3 (.69.)    1.124    0.024   46.034    0.000    1.124    1.088
    st__4|4 (.70.)    1.846    0.035   52.327    0.000    1.846    1.786

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.279                               0.279    0.292
   .stars_int_11      0.281                               0.281    0.274
   .stars_int_1       0.275                               0.275    0.281
   .stars_int_6       0.368                               0.368    0.368
   .stars_int_5       0.736                               0.736    0.649
   .stars_int_2       0.334                               0.334    0.372
   .stars_test_3      0.227                               0.227    0.239
   .stars_test_6      0.247                               0.247    0.293
   .stars_test_4      0.260                               0.260    0.300
   .stars_test_1      0.247                               0.247    0.276
   .stars_test_8      0.486                               0.486    0.534
   .stars_help_2      0.298                               0.298    0.280
   .stars_help_3      0.188                               0.188    0.177
   .stars_help_4      0.275                               0.275    0.257
    interpret         0.677    0.027   25.171    0.000    1.000    1.000
    test              0.723    0.030   24.033    0.000    1.000    1.000
    help              0.765    0.034   22.702    0.000    1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stars_int_3       1.023    0.018   55.503    0.000    1.023    1.000
    stars_int_11      0.988    0.018   54.144    0.000    0.988    1.000
    stars_int_1       1.010    0.018   55.097    0.000    1.010    1.000
    stars_int_6       1.000    0.018   56.019    0.000    1.000    1.000
    stars_int_5       0.939    0.019   49.275    0.000    0.939    1.000
    stars_int_2       1.055    0.019   55.640    0.000    1.055    1.000
    stars_test_3      1.026    0.020   52.186    0.000    1.026    1.000
    stars_test_6      1.089    0.020   54.743    0.000    1.089    1.000
    stars_test_4      1.074    0.020   54.476    0.000    1.074    1.000
    stars_test_1      1.057    0.020   53.761    0.000    1.057    1.000
    stars_test_8      1.048    0.021   50.480    0.000    1.048    1.000
    stars_help_2      0.970    0.019   50.422    0.000    0.970    1.000
    stars_help_3      0.973    0.020   47.804    0.000    0.973    1.000
    stars_help_4      0.967    0.019   49.755    0.000    0.967    1.000

🤔 Do group differences in observed scores reflect true differences in the underlying constructs?

Solution

✔️ The scalar invariance model shows acceptable (though slightly reduced) fit, with only small changes relative to the metric model. This suggests that thresholds can be considered approximately equal across groups, allowing for meaningful comparisons of latent means.

Specifically:

The model ran properly (“lavaan ended normally after 62 iterations”)
Equality constraints were successfully applied (67 constraints: loadings + thresholds constrained equal across groups)
Model fit (compared to metric) remains very similar overall, with only small deterioration. In other words, adding equality constraints on thresholds did not meaningfully worsen model fit, supporting scalar invariance.
- Robust CFI ≈ .953 (previously ≈ .951)
- Robust TLI ≈ .954 (previously ≈ .944)
- Robust RMSEA ≈ .081 (previously ≈ .078)
- SRMR ≈ .045 (previously ≈ .046)
Group specific χ² showed different levels of misfit across groups (Non-English: χ² ≈ 2207; English: χ² ≈ 1638), but nothing dramatic or alarming (or new)
Factor loadings are now constrained equal across groups so we don’t need to interpret them (we know they’ll be the same) — in this model we’re interpretting the model fit to see if forcing equal loadings reduces model fit substantially (if it does, it suggests loadings aren’t equal).
Thresholds are now also constrained to be equal across groups and, again, aren’t interpreted individually. Rather, the fact that model fit remains acceptable suggests that response category boundaries operate similarly across groups.
Factor correlations, thresholds, and residual variances are all interpreted as before.

Imposing equality constraints on both loadings and thresholds does not meaningfully reduce model fit. This supports scalar invariance, indicating that individuals with the same level of the latent construct respond to items similarly across groups.

At stage, differences in observed scores can be interpreted as reflecting true differences in the underlying constructs, allowing for meaningful comparisons of latent means across groups.

Step 4: Compare Models

The key question at this stage is whether adding equality constraints leads to a meaningful deterioration in model fit.

Measurement invariance is evaluated by comparing increasingly constrained models (configural, metric, scalar, and strict) to determine whether the added constraints are tenable. Rather than focusing on absolute fit, the emphasis is on changes in model fit between successive models.

To evaluate this, we compare models using changes in fit indices such as CFI and RMSEA. Small changes in these indices suggest that the constraints do not substantially worsen model fit, supporting invariance at that level. As a general guideline, a decrease in CFI of less than .01 and an increase in RMSEA of less than .015 are often taken as evidence that invariance holds.

In addition to overall fit indices, it is also important to consider parameter changes and the substantive meaning of the constraints. For example, if imposing equality constraints leads to noticeable shifts in loadings or thresholds, this may indicate that items function differently across groups, even if global fit indices appear acceptable.

If model fit remains stable as constraints are added, this supports the conclusion that the measurement model operates equivalently across groups at that level of invariance. However, if model fit deteriorates substantially, this suggests that the assumption of invariance is violated. In such cases, it may be appropriate to explore partial invariance by relaxing specific constraints rather than abandoning the analysis entirely.

lavaan::anova(fit_configural, fit_metric, fit_scalar)


Scaled Chi-Squared Difference Test (method = "satorra.2000")

lavaan->lavTestLRT():  
   lavaan NOTE: The "Chisq" column contains standard test statistics, not the 
   robust test that should be reported per model. A robust difference test is 
   a function of two standard (not robust) statistics.
                Df AIC BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
fit_configural 148         2312.0                                  
fit_metric     159         2396.2      47.22      11  1.969e-06 ***
fit_scalar     198         2778.6     627.87      39  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lavaan::fitMeasures(fit_configural, c("cfi", "tli", "rmsea", "srmr"))

  cfi   tli rmsea  srmr 
0.994 0.993 0.065 0.045

lavaan::fitMeasures(fit_metric, c("cfi", "tli", "rmsea", "srmr"))

  cfi   tli rmsea  srmr 
0.994 0.993 0.064 0.046

lavaan::fitMeasures(fit_scalar, c("cfi", "tli", "rmsea", "srmr"))

  cfi   tli rmsea  srmr 
0.993 0.994 0.062 0.046

🤔 Does adding equality constraints lead to a meaningful deterioration in model fit?

Solution

Overall Interpretation

✔️ Model comparisons provide strong support for scalar invariance, despite the chi-square test suggesting some differences at earlier stages.

Specifically:

The scaled chi-square difference test comparing the configural and metric models is statistically significant (Δχ² ≈ 47.22, df = 11, p < .001), which would traditionally suggest that constraining loadings worsens model fit.

However, this needs to be interpreted with caution:

Chi-square tests are highly sensitive to large sample sizes (and ours is very large)
Even trivial differences can become statistically significant

So we turn to practical fit indices instead.

Fit index comparison:

Model fit remains extremely stable across all models:

CFI: .994 → .994 → .994
TLI: .993 → .993 → .994
RMSEA: .065 → .064 → .059
SRMR: .045 → .046 → .045

Changes are tiny and well within recommended thresholds (e.g., ΔCFI < .01)

In fact, if anything, fit slightly improves as constraints are added (which is always a bit suspicious, but not uncommon with WLSMV).

Metric Invariance

Although the chi-square difference test is significant:

Changes in CFI, TLI, RMSEA, and SRMR are negligible
There is no meaningful deterioration in model fit

✔️ This supports metric invariance, meaning factor loadings can be treated as equal across groups.

This indicates that:

The constructs are measured in the same way across groups (metric invariance)

Scalar Invariance

The comparison between the metric and scalar models is not statistically significant (Δχ² ≈ 39.19, df = 28, p ≈ .078).

This indicates that adding equality constraints on thresholds does not significantly worsen model fit

Combined with:

Essentially unchanged fit indices

✔️ This supports scalar invariance

Imposing equality constraints on loadings and thresholds does not meaningfully reduce model fit.

This indicates that:

Individuals with the same level of the latent construct respond similarly to items across groups (scalar invariance)

👉 Taken together, differences in observed scores can be interpreted as reflecting true differences in the underlying constructs, allowing for meaningful comparisons of latent means across groups.

⚠️ Partial Invariance: When Invariance Is Not Supported

Sometimes, invariance testing doesn’t behave nicely. You add constraints and suddenly model fit drops, chi-square screams, and your neat cross-group comparison starts looking questionable.

Full invariance assumes that all parameters (e.g., loadings or thresholds) are identical across groups. That’s often unrealisitic in practice.

A small number of items frequently behave differently across groups — e.g., due to wording, cultural interpretation, or just statistical noise.

Partial invariance allows some parameters to vary, while keeping most constraints in place.

The key question at this stage is whether a small number of non-invariant parameters can be accommodated without undermining the overall comparability of the model.

To evaluate this, we examine model comparisons and, if invariance is not supported, inspect score tests to identify which constraints are contributing most to model misfit. We then free these parameters one at a time and re-estimate the model.

If model fit improves and the majority of parameters remain invariant, partial invariance may be considered acceptable. This allows meaningful comparisons to proceed, provided that enough parameters are still constrained to anchor the scale.

However, if too many parameters need to be freed, or if (theoretically) key items show non-invariance, this suggests that the construct may not be operating equivalently across groups. In such cases, comparisons should be interpreted with caution or reconsidered altogether.

Partial invariance is not a workaround for a poor model, but a pragmatic recognition that small differences between groups are often unavoidable.

How many freed parameters is too many?

Partial invariance is acceptable when most parameters remain equal across groups and each factor retains at least two invariant items. However, if many parameters must be freed or non-invariance is concentrated within a factor, this suggests the construct may not be comparable across groups.

Invariance testing is not about finding a “perfect” model, but about determining how much equality can be reasonably assumed across groups without distorting the measurement of the construct.

How to Identify Problems

Score tests evaluate whether the equality constraints imposed in an invariance model are tenable.

Specifically, they test:

Would freeing each constrained parameter (e.g., a loading or threshold) significantly improve model fit?

Each test considers one constraint at a time. A significant result suggests that the parameter may differ across groups and that the equality constraint is contributing to model misfit.

lavaan::lavTestScore(fit_scalar)

Warning: lavaan->lavTestScore():  
   se is not `standard'; not implemented yet; falling back to ordinary score 
   test

$test

total score test:

   test      X2 df p.value
1 score 466.451 67       0

$uni

univariate score tests:

     lhs op    rhs     X2 df p.value
1   .p2. == .p123.  1.378  1   0.240
2   .p3. == .p124.  5.197  1   0.023
3   .p4. == .p125.  0.758  1   0.384
4   .p5. == .p126.  9.861  1   0.002
5   .p6. == .p127. 22.041  1   0.000
6   .p8. == .p129.  0.043  1   0.836
7   .p9. == .p130.  0.146  1   0.702
8  .p10. == .p131.  0.857  1   0.354
9  .p11. == .p132. 16.203  1   0.000
10 .p13. == .p134.  0.030  1   0.862
11 .p14. == .p135. 29.340  1   0.000
12 .p15. == .p136. 14.342  1   0.000
13 .p16. == .p137.  0.044  1   0.834
14 .p17. == .p138.  5.347  1   0.021
15 .p18. == .p139.  0.370  1   0.543
16 .p19. == .p140.  9.210  1   0.002
17 .p20. == .p141.  0.103  1   0.748
18 .p21. == .p142.  0.043  1   0.836
19 .p22. == .p143.  0.281  1   0.596
20 .p23. == .p144.  0.043  1   0.836
21 .p24. == .p145. 13.566  1   0.000
22 .p25. == .p146.  4.003  1   0.045
23 .p26. == .p147.  1.149  1   0.284
24 .p27. == .p148.  2.183  1   0.140
25 .p28. == .p149. 13.607  1   0.000
26 .p29. == .p150.  6.364  1   0.012
27 .p30. == .p151.  0.000  1   0.995
28 .p31. == .p152. 52.688  1   0.000
29 .p32. == .p153. 22.579  1   0.000
30 .p33. == .p154.  1.487  1   0.223
31 .p34. == .p155.  2.781  1   0.095
32 .p35. == .p156.  2.695  1   0.101
33 .p36. == .p157.  5.798  1   0.016
34 .p37. == .p158.  6.868  1   0.009
35 .p38. == .p159.  8.913  1   0.003
36 .p39. == .p160.  1.113  1   0.291
37 .p40. == .p161.  6.990  1   0.008
38 .p41. == .p162. 21.957  1   0.000
39 .p42. == .p163. 44.822  1   0.000
40 .p43. == .p164.  0.087  1   0.768
41 .p44. == .p165.  0.240  1   0.624
42 .p45. == .p166.  0.134  1   0.714
43 .p46. == .p167.  0.031  1   0.861
44 .p47. == .p168.  0.000  1   0.986
45 .p48. == .p169.  0.344  1   0.558
46 .p49. == .p170.  1.464  1   0.226
47 .p50. == .p171.  6.824  1   0.009
48 .p51. == .p172.  0.577  1   0.448
49 .p52. == .p173.  0.506  1   0.477
50 .p53. == .p174.  0.049  1   0.825
51 .p54. == .p175.  2.781  1   0.095
52 .p55. == .p176. 36.565  1   0.000
53 .p56. == .p177. 10.387  1   0.001
54 .p57. == .p178.  2.632  1   0.105
55 .p58. == .p179.  0.265  1   0.607
56 .p59. == .p180. 19.348  1   0.000
57 .p60. == .p181. 34.664  1   0.000
58 .p61. == .p182. 27.500  1   0.000
59 .p62. == .p183. 24.423  1   0.000
60 .p63. == .p184.  4.084  1   0.043
61 .p64. == .p185.  0.179  1   0.672
62 .p65. == .p186.  0.018  1   0.893
63 .p66. == .p187.  0.594  1   0.441
64 .p67. == .p188. 51.585  1   0.000
65 .p68. == .p189. 22.176  1   0.000
66 .p69. == .p190. 17.744  1   0.000
67 .p70. == .p191.  5.976  1   0.014

What does the overall test mean?

The overall score test evaluates all constraints simultaneously.

In this case: X²(67) = 101.39, p = .004

This tests the null hypothesis that all constrained parameters are equal across groups.

Since the test is significant, we reject this hypothesis, indicating that at least some equality constraints are violated.

Interpreting this in Practice (especially with large samples):

With large samples (as here), score tests are highly sensitive. This means they can detect:

very small differences between groups
even when those differences have negligible impact on overall model fit

So, a significant result does not necessarily mean that invariance has meaningfully failed.

Instead, it often reflects minor deviations from perfect equality rather than substantively important differences.

The individual score tests allow us to explore that further.

What do the individual score tests mean?

The output from lavTestScore() also provides a separate test for each equality constraint in the model.

Each row tests: Does this specific parameter differ across groups?

A significant result indicates that freeing that parameter would improve model fit, suggesting potential non-invariance.

Step 1: Focus on Effect Size, Not Just p-values

With large samples (which most MI samples are!), many parameters will be statistically significant even if differences are trivial.

Instead of focusing only on p-values, prioritise:

Large score test statistics (X² values)
- Rough guide:
  - ~3–5 → small
  - ~5–10 → moderate
  - >10 → potentially meaningful
Consistency across parameters — i.e., issues cluster within a factor or the same item appearing repeatedly
Substantive plausibility — does it make theoretical sense that this item behaves differently?

Step 2: Which Parameters Should You Free?

Only consider freeing parameters that are:

already in the model (e.g., loadings =~, thresholds |)
strongly supported by the score tests (large X²)
part of a clear, interpretable pattern, not isolated noise

Avoid:

freeing many parameters at once (free one at a time and check the impact before freeing another)
freeing parameters purely based on p < .05

Do not chase every significant constraint — focus on the few that meaningfully affect the model.

What do the individual tests for our model show?

Most parameters:

have very small X² values
are non-significant
or fall in the trivial range

However, a small number stand out (larger violations, X² > 10):

.p6. → 32.35
.p11. → 15.42
.p5. → 12.75
.p15. → 13.29
.p38. → 12.98
.p55. → 11.80
.p36. → 10.34

These suggest some non-invariance, but importantly:

they are few in number
they are not obviously concentrated within a single factor or item (pending mapping)

Step 2: Map Parameter Labels to Model Terms

lavaan reports constraints using internal labels (e.g., .p6. == .p127.), which correspond to the same parameter across groups.

To identify the actual parameter (using .p.6 as an example):

params <- lavaan::parameterEstimates(fit_scalar)

params |>
  dplyr::select(lhs, op, rhs, label) |>
  dplyr::filter(label == ".p6.")

        lhs op         rhs label
1 interpret =~ stars_int_2  .p6.
2 interpret =~ stars_int_2  .p6.

This tells us that .p.6 represents the parameter: interpret =~ stars_int_2, aka the loading of stars_int_2 on the interpretation anxiety factor.

It would be worth mapping the other parameters to see whether there are consistent problems in a given factor, but otherwise, there wouldn’t be a strong case for removing any items here.

Putting this in Context

Crucially, earlier model comparisons showed:

negligible changes in CFI, RMSEA, and SRMR
non-significant χ² difference between metric and scalar models

This means:

Constraining these parameters does not meaningfully worsen model fit

What if there was an issue with a particular item?

In practice, it is common for a small number of parameters to show evidence of non-invariance. The key question is:

Are these differences large enough to matter?

If only a small number of parameters are problematic, we can allow for partial invariance by freeing those specific constraints.

How to Implement Partial Invariance

To be clear, we don’t really need to do this, but if we did want to allow partial invariance, this is how you’d do it.

Free only the clearly problematic parameter(s):

fit_partial <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
  group = "language",
  group.equal = c("loadings", "thresholds"),
1  group.partial = c("interpret=~stars_int_2"),
  estimator = "WLSMV",
  ordered = item_vars
)

lavaan::summary(fit_partial, fit.measures = TRUE, standardized = TRUE)

1: Frees the interpret =~ stars_int_2 parameter. Otherwise, all code is as per the scalar model.

lavaan 0.6-19 ended normally after 68 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                       163
  Number of equality constraints                    66

  Number of observations per group:               Used       Total
    Non-English                                   3227        3236
    English                                       3641        3649

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                              2756.591    4193.665
  Degrees of freedom                               197         197
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  0.665
  Shift parameter                                           46.119
    simple second-order correction                                
  Test statistic for each group:
    Non-English                               2378.784    2378.784
    English                                   1814.881    1814.881

Model Test Baseline Model:

  Test statistic                            377816.121  138627.992
  Degrees of freedom                               182         182
  P-value                                        0.000       0.000
  Scaling correction factor                                  2.728

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.993       0.971
  Tucker-Lewis Index (TLI)                       0.994       0.973
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.062       0.077
  90 Percent confidence interval - lower         0.059       0.075
  90 Percent confidence interval - upper         0.064       0.079
  P-value H_0: RMSEA <= 0.050                    0.000       0.000
  P-value H_0: RMSEA >= 0.080                    0.000       0.006
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                        NA
  90 Percent confidence interval - upper                        NA
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.046       0.046

Parameter Estimates:

  Parameterization                               Delta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured


Group 1 [Non-English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    strs__3           1.000                               0.834    0.834
    str__11 (.p2.)    1.048    0.009  115.900    0.000    0.874    0.874
    strs__1 (.p3.)    1.021    0.009  117.633    0.000    0.851    0.851
    strs__6 (.p4.)    0.968    0.010   97.699    0.000    0.807    0.807
    strs__5 (.p5.)    0.770    0.014   55.328    0.000    0.642    0.642
    strs__2           0.902    0.012   75.267    0.000    0.752    0.752
  test =~                                                               
    strs__3           1.000                               0.898    0.898
    strs__6 (.p8.)    0.908    0.009  100.493    0.000    0.815    0.815
    strs__4 (.p9.)    0.916    0.009   99.119    0.000    0.823    0.823
    strs__1 (.10.)    0.947    0.009  106.834    0.000    0.850    0.850
    strs__8 (.11.)    0.766    0.012   62.206    0.000    0.688    0.688
  help =~                                                               
    strs__2           1.000                               0.823    0.823
    strs__3 (.13.)    1.066    0.013   84.701    0.000    0.877    0.877
    strs__4 (.14.)    1.019    0.012   83.246    0.000    0.839    0.839

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.487    0.011   43.466    0.000    0.651    0.651
    help              0.436    0.011   38.223    0.000    0.635    0.635
  test ~~                                                               
    help              0.441    0.012   36.615    0.000    0.597    0.597

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    st__3|1 (.15.)   -0.421    0.020  -20.930    0.000   -0.421   -0.421
    st__3|2 (.16.)    0.321    0.019   17.069    0.000    0.321    0.321
    st__3|3 (.17.)    1.000    0.022   44.574    0.000    1.000    1.000
    st__3|4 (.18.)    1.729    0.032   53.339    0.000    1.729    1.729
    s__11|1 (.19.)   -0.746    0.023  -32.920    0.000   -0.746   -0.746
    s__11|2 (.20.)    0.034    0.019    1.778    0.075    0.034    0.034
    s__11|3 (.21.)    0.700    0.021   33.620    0.000    0.700    0.700
    s__11|4 (.22.)    1.447    0.028   51.325    0.000    1.447    1.447
    st__1|1 (.23.)   -0.703    0.022  -31.976    0.000   -0.703   -0.703
    st__1|2 (.24.)    0.067    0.019    3.570    0.000    0.067    0.067
    st__1|3 (.25.)    0.753    0.021   36.104    0.000    0.753    0.753
    st__1|4 (.26.)    1.488    0.028   52.519    0.000    1.488    1.488
    st__6|1 (.27.)   -0.740    0.022  -33.263    0.000   -0.740   -0.740
    st__6|2 (.28.)    0.062    0.019    3.334    0.001    0.062    0.062
    st__6|3 (.29.)    0.753    0.020   37.147    0.000    0.753    0.753
    st__6|4 (.30.)    1.514    0.028   53.489    0.000    1.514    1.514
    st__5|1 (.31.)    0.127    0.018    7.061    0.000    0.127    0.127
    st__5|2 (.32.)    0.787    0.020   38.567    0.000    0.787    0.787
    st__5|3 (.33.)    1.373    0.027   50.830    0.000    1.373    1.373
    st__5|4 (.34.)    1.976    0.038   51.723    0.000    1.976    1.976
    st__2|1 (.35.)   -0.819    0.023  -35.331    0.000   -0.819   -0.819
    st__2|2 (.36.)    0.012    0.019    0.648    0.517    0.012    0.012
    st__2|3 (.37.)    0.764    0.022   35.348    0.000    0.764    0.764
    st__2|4 (.38.)    1.615    0.033   49.487    0.000    1.615    1.615
    st__3|1 (.39.)   -1.829    0.038  -48.064    0.000   -1.829   -1.829
    st__3|2 (.40.)   -1.079    0.025  -42.490    0.000   -1.079   -1.079
    st__3|3 (.41.)   -0.445    0.021  -21.710    0.000   -0.445   -0.445
    st__3|4 (.42.)    0.322    0.020   16.253    0.000    0.322    0.322
    st__6|1 (.43.)   -1.474    0.029  -50.201    0.000   -1.474   -1.474
    st__6|2 (.44.)   -0.748    0.021  -35.228    0.000   -0.748   -0.748
    st__6|3 (.45.)   -0.188    0.019  -10.056    0.000   -0.188   -0.188
    st__6|4 (.46.)    0.500    0.020   25.476    0.000    0.500    0.500
    st__4|1 (.47.)   -1.488    0.029  -50.691    0.000   -1.488   -1.488
    st__4|2 (.48.)   -0.816    0.022  -37.532    0.000   -0.816   -0.816
    st__4|3 (.49.)   -0.265    0.019  -13.978    0.000   -0.265   -0.265
    st__4|4 (.50.)    0.456    0.020   23.206    0.000    0.456    0.456
    st__1|1 (.51.)   -1.206    0.026  -46.280    0.000   -1.206   -1.206
    st__1|2 (.52.)   -0.458    0.020  -22.911    0.000   -0.458   -0.458
    st__1|3 (.53.)    0.205    0.019   10.721    0.000    0.205    0.205
    st__1|4 (.54.)    0.978    0.024   41.407    0.000    0.978    0.978
    st__8|1 (.55.)   -0.879    0.022  -40.085    0.000   -0.879   -0.879
    st__8|2 (.56.)   -0.229    0.018  -12.681    0.000   -0.229   -0.229
    st__8|3 (.57.)    0.333    0.018   18.408    0.000    0.333    0.333
    st__8|4 (.58.)    1.018    0.023   44.542    0.000    1.018    1.018
    st__2|1 (.59.)   -0.699    0.023  -30.946    0.000   -0.699   -0.699
    st__2|2 (.60.)    0.062    0.019    3.240    0.001    0.062    0.062
    st__2|3 (.61.)    0.649    0.020   32.210    0.000    0.649    0.649
    st__2|4 (.62.)    1.374    0.026   52.251    0.000    1.374    1.374
    st__3|1 (.63.)   -0.607    0.022  -27.189    0.000   -0.607   -0.607
    st__3|2 (.64.)    0.174    0.020    8.867    0.000    0.174    0.174
    st__3|3 (.65.)    0.792    0.022   36.603    0.000    0.792    0.792
    st__3|4 (.66.)    1.556    0.030   51.553    0.000    1.556    1.556
    st__4|1 (.67.)   -0.334    0.020  -16.368    0.000   -0.334   -0.334
    st__4|2 (.68.)    0.496    0.020   24.852    0.000    0.496    0.496
    st__4|3 (.69.)    1.124    0.024   46.034    0.000    1.124    1.124
    st__4|4 (.70.)    1.846    0.035   52.328    0.000    1.846    1.846

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.304                               0.304    0.304
   .stars_int_11      0.236                               0.236    0.236
   .stars_int_1       0.275                               0.275    0.275
   .stars_int_6       0.349                               0.349    0.349
   .stars_int_5       0.588                               0.588    0.588
   .stars_int_2       0.434                               0.434    0.434
   .stars_test_3      0.194                               0.194    0.194
   .stars_test_6      0.336                               0.336    0.336
   .stars_test_4      0.323                               0.323    0.323
   .stars_test_1      0.277                               0.277    0.277
   .stars_test_8      0.527                               0.527    0.527
   .stars_help_2      0.322                               0.322    0.322
   .stars_help_3      0.230                               0.230    0.230
   .stars_help_4      0.297                               0.297    0.297
    interpret         0.696    0.011   65.437    0.000    1.000    1.000
    test              0.806    0.010   78.047    0.000    1.000    1.000
    help              0.678    0.012   55.925    0.000    1.000    1.000


Group 2 [English]:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret =~                                                          
    strs__3           1.000                               0.807    0.841
    str__11 (.p2.)    1.048    0.009  115.900    0.000    0.845    0.851
    strs__1 (.p3.)    1.021    0.009  117.633    0.000    0.824    0.847
    strs__6 (.p4.)    0.968    0.010   97.699    0.000    0.781    0.794
    strs__5 (.p5.)    0.770    0.014   55.328    0.000    0.621    0.591
    strs__2           1.018    0.025   40.376    0.000    0.821    0.798
  test =~                                                               
    strs__3           1.000                               0.850    0.872
    strs__6 (.p8.)    0.908    0.009  100.493    0.000    0.772    0.841
    strs__4 (.p9.)    0.916    0.009   99.119    0.000    0.779    0.837
    strs__1 (.10.)    0.947    0.009  106.834    0.000    0.805    0.851
    strs__8 (.11.)    0.766    0.012   62.206    0.000    0.651    0.683
  help =~                                                               
    strs__2           1.000                               0.875    0.848
    strs__3 (.13.)    1.066    0.013   84.701    0.000    0.932    0.907
    strs__4 (.14.)    1.019    0.012   83.246    0.000    0.891    0.862

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  interpret ~~                                                          
    test              0.454    0.019   24.360    0.000    0.662    0.662
    help              0.436    0.019   23.141    0.000    0.618    0.618
  test ~~                                                               
    help              0.449    0.020   22.958    0.000    0.604    0.604

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    interpret         0.254    0.021   12.186    0.000    0.315    0.315
    test              0.222    0.023    9.631    0.000    0.261    0.261
    help              0.347    0.023   15.377    0.000    0.396    0.396

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    st__3|1 (.15.)   -0.421    0.020  -20.930    0.000   -0.421   -0.438
    st__3|2 (.16.)    0.321    0.019   17.069    0.000    0.321    0.335
    st__3|3 (.17.)    1.000    0.022   44.574    0.000    1.000    1.042
    st__3|4 (.18.)    1.729    0.032   53.339    0.000    1.729    1.801
    s__11|1 (.19.)   -0.746    0.023  -32.920    0.000   -0.746   -0.751
    s__11|2 (.20.)    0.034    0.019    1.778    0.075    0.034    0.034
    s__11|3 (.21.)    0.700    0.021   33.620    0.000    0.700    0.705
    s__11|4 (.22.)    1.447    0.028   51.325    0.000    1.447    1.457
    st__1|1 (.23.)   -0.703    0.022  -31.976    0.000   -0.703   -0.723
    st__1|2 (.24.)    0.067    0.019    3.570    0.000    0.067    0.069
    st__1|3 (.25.)    0.753    0.021   36.104    0.000    0.753    0.774
    st__1|4 (.26.)    1.488    0.028   52.519    0.000    1.488    1.531
    st__6|1 (.27.)   -0.740    0.022  -33.263    0.000   -0.740   -0.752
    st__6|2 (.28.)    0.062    0.019    3.334    0.001    0.062    0.063
    st__6|3 (.29.)    0.753    0.020   37.147    0.000    0.753    0.765
    st__6|4 (.30.)    1.514    0.028   53.489    0.000    1.514    1.538
    st__5|1 (.31.)    0.127    0.018    7.061    0.000    0.127    0.121
    st__5|2 (.32.)    0.787    0.020   38.567    0.000    0.787    0.748
    st__5|3 (.33.)    1.373    0.027   50.830    0.000    1.373    1.305
    st__5|4 (.34.)    1.976    0.038   51.723    0.000    1.976    1.879
    st__2|1 (.35.)   -0.819    0.023  -35.331    0.000   -0.819   -0.796
    st__2|2 (.36.)    0.012    0.019    0.648    0.517    0.012    0.012
    st__2|3 (.37.)    0.764    0.022   35.348    0.000    0.764    0.742
    st__2|4 (.38.)    1.615    0.033   49.487    0.000    1.615    1.569
    st__3|1 (.39.)   -1.829    0.038  -48.064    0.000   -1.829   -1.877
    st__3|2 (.40.)   -1.079    0.025  -42.490    0.000   -1.079   -1.107
    st__3|3 (.41.)   -0.445    0.021  -21.710    0.000   -0.445   -0.457
    st__3|4 (.42.)    0.322    0.020   16.253    0.000    0.322    0.330
    st__6|1 (.43.)   -1.474    0.029  -50.201    0.000   -1.474   -1.605
    st__6|2 (.44.)   -0.748    0.021  -35.228    0.000   -0.748   -0.815
    st__6|3 (.45.)   -0.188    0.019  -10.056    0.000   -0.188   -0.205
    st__6|4 (.46.)    0.500    0.020   25.476    0.000    0.500    0.544
    st__4|1 (.47.)   -1.488    0.029  -50.691    0.000   -1.488   -1.598
    st__4|2 (.48.)   -0.816    0.022  -37.532    0.000   -0.816   -0.876
    st__4|3 (.49.)   -0.265    0.019  -13.978    0.000   -0.265   -0.285
    st__4|4 (.50.)    0.456    0.020   23.206    0.000    0.456    0.490
    st__1|1 (.51.)   -1.206    0.026  -46.280    0.000   -1.206   -1.275
    st__1|2 (.52.)   -0.458    0.020  -22.911    0.000   -0.458   -0.484
    st__1|3 (.53.)    0.205    0.019   10.721    0.000    0.205    0.217
    st__1|4 (.54.)    0.978    0.024   41.407    0.000    0.978    1.033
    st__8|1 (.55.)   -0.879    0.022  -40.085    0.000   -0.879   -0.921
    st__8|2 (.56.)   -0.229    0.018  -12.681    0.000   -0.229   -0.240
    st__8|3 (.57.)    0.333    0.018   18.408    0.000    0.333    0.349
    st__8|4 (.58.)    1.018    0.023   44.542    0.000    1.018    1.067
    st__2|1 (.59.)   -0.699    0.023  -30.946    0.000   -0.699   -0.678
    st__2|2 (.60.)    0.062    0.019    3.240    0.001    0.062    0.061
    st__2|3 (.61.)    0.649    0.020   32.210    0.000    0.649    0.630
    st__2|4 (.62.)    1.374    0.026   52.251    0.000    1.374    1.332
    st__3|1 (.63.)   -0.607    0.022  -27.189    0.000   -0.607   -0.590
    st__3|2 (.64.)    0.174    0.020    8.867    0.000    0.174    0.170
    st__3|3 (.65.)    0.792    0.022   36.603    0.000    0.792    0.770
    st__3|4 (.66.)    1.556    0.030   51.553    0.000    1.556    1.514
    st__4|1 (.67.)   -0.334    0.020  -16.368    0.000   -0.334   -0.323
    st__4|2 (.68.)    0.496    0.020   24.852    0.000    0.496    0.479
    st__4|3 (.69.)    1.124    0.024   46.034    0.000    1.124    1.088
    st__4|4 (.70.)    1.846    0.035   52.328    0.000    1.846    1.786

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stars_int_3       0.270                               0.270    0.293
   .stars_int_11      0.272                               0.272    0.275
   .stars_int_1       0.267                               0.267    0.282
   .stars_int_6       0.358                               0.358    0.370
   .stars_int_5       0.720                               0.720    0.651
   .stars_int_2       0.384                               0.384    0.363
   .stars_test_3      0.227                               0.227    0.239
   .stars_test_6      0.247                               0.247    0.293
   .stars_test_4      0.260                               0.260    0.300
   .stars_test_1      0.247                               0.247    0.276
   .stars_test_8      0.486                               0.486    0.534
   .stars_help_2      0.298                               0.298    0.280
   .stars_help_3      0.188                               0.188    0.177
   .stars_help_4      0.275                               0.275    0.257
    interpret         0.651    0.027   24.483    0.000    1.000    1.000
    test              0.723    0.030   24.033    0.000    1.000    1.000
    help              0.765    0.034   22.702    0.000    1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stars_int_3       1.042    0.019   54.099    0.000    1.042    1.000
    stars_int_11      1.007    0.019   52.644    0.000    1.007    1.000
    stars_int_1       1.029    0.019   53.536    0.000    1.029    1.000
    stars_int_6       1.016    0.019   54.531    0.000    1.016    1.000
    stars_int_5       0.951    0.019   49.253    0.000    0.951    1.000
    stars_int_2       0.972    0.021   45.785    0.000    0.972    1.000
    stars_test_3      1.026    0.020   52.185    0.000    1.026    1.000
    stars_test_6      1.089    0.020   54.744    0.000    1.089    1.000
    stars_test_4      1.074    0.020   54.477    0.000    1.074    1.000
    stars_test_1      1.057    0.020   53.761    0.000    1.057    1.000
    stars_test_8      1.048    0.021   50.481    0.000    1.048    1.000
    stars_help_2      0.970    0.019   50.421    0.000    0.970    1.000
    stars_help_3      0.973    0.020   47.804    0.000    0.973    1.000
    stars_help_4      0.967    0.019   49.755    0.000    0.967    1.000

Interpretation

After freeing the loading for interpret =~ stars_int_2:

Model fit improves slightly (χ² decreases from 2413.61 → 2381.16; df decreases by 1)
Fit indices remain essentially unchanged (CFI ≈ .974, RMSEA ≈ .075, SRMR ≈ .045)
The model still shows acceptable fit overall

Crucially:

Only one parameter was freed
The rest of the measurement model remains constrained and stable

Conclusion: Comparisons across groups are still valid. The scale is functioning similarly across groups, with only a minor deviation for one item.

This is a textbook example of partial invariance working as intended.

Should we remove the item instead?

In this case, no — and this is exactly the situation where people are tempted to overreact.

Freeing the parameter is sufficient because:

The non-invariance is localised (just one loading)
The item still has strong loadings in both groups (~.75 vs ~.80 standardized)
Overall model fit is already good

Removing the item could:

reduce content coverage
break comparability with previous STARS work
solve a problem that is already adequately handled

When might item removal be justified?

Only consider removing an item if you see a pattern, not a blip:

The same item is repeatedly non-invariant across models or samples
It has weak loadings or behaves inconsistently
It lacks clear theoretical justification

That is not what’s happening here.

Final Interpretation

Although several individual constraints show statistically significant violations, these are:

relatively small in number
modest in magnitude (given sample size)
not clearly concentrated within a specific part of the model

Combined with excellent overall model fit, this suggests:

Any non-invariance is minor and does not meaningfully affect comparability across groups

Therefore:

Scalar invariance can be retained for practical purposes, without introducing partial invariance.

Key Takeaway

Score tests detect any deviation from perfect equality, but invariance decisions should be based on whether those deviations matter in practice.

Report

You should report the following:

Model Specification
- In-text: Describe the factor structure and make clear that it is being tested across groups.
  - What is the model?
  - What groups are being compared?
Groups
- In-text: State the grouping variable and group sizes.
  - e.g., gender, language, country
  - Include N per group where possible
Estimator
- In-text: Same as CFA — one or two sentences.
  - e.g., DWLS/WLSMV for ordinal data
Invariance Models
- In-text: State the sequence of models tested:
  - Configural
  - Metric
  - Scalar
  - (Strict, if used)
Fit Indices
- In-text: Summarise fit for each model briefly.
  - Do not write a full paragraph per model
  - Focus on whether fit is acceptable and how it changes
- Table: Key fit indices (and change between models), for example:

Model	CFI	TLI	RMSEA	SRMR	ΔCFI
Configural	…	…	…	…	—
Metric	…	…	…	…	…
Scalar	…	…	…	…	…

6. Changes in Fit

In-text: Explicitly report ΔCFI (and, optionally, ΔRMSEA) between models.
- This is how invariance is judged
- Absolute fit matters less here

7. Decisions at Each Stage

In-text: After each step, clearly state:
- Whether invariance is supported
- Based on what criterion

8. Modifications / Partial Invariance

In-text: Only include if relevant.
- What parameter was freed (e.g., thresholds/intercepts)
- Why (e.g., score tests)
- What happened after
Table: If you free/remove multiple parameters, you could include a table.

9. Final Interpretation

In-text: State clearly:
- highest level of invariance achieved
- what that allows you to do, e.g.:
  - Metric → compare relationships
  - Scalar → compare latent means

🤔 How would you write up the results of this MI analysis in-text and in a table?

Solution

A multi-group confirmatory factor analysis (CFA) was conducted to test measurement invariance of the STARS across language groups (English vs. non-English). The hypothesised three-factor model, comprising Interpretation Anxiety, Test Anxiety, and Help-Seeking, was evaluated across groups. The sample consisted of 3649 English-speaking participants and 3236 non-English-speaking participants. The model was estimated using the DWLS estimator, which is appropriate for ordinal data.

Measurement invariance was assessed sequentially by fitting configural, metric, and scalar models. As shown in Table 1, overall model fit was acceptable and remained highly stable across all stages. The configural model demonstrated good fit (CFI = .994, RMSEA = .065), indicating that the same factor structure was plausible across groups. Constraining factor loadings to equality in the metric model did not meaningfully change model fit (CFI = .994, RMSEA = .064), and further constraining thresholds in the scalar model likewise resulted in negligible changes (CFI = .994, RMSEA = .059).

Evaluation of changes in fit indices indicated no meaningful deterioration in model fit at any stage. The change in CFI was zero for both the configural-to-metric and metric-to-scalar comparisons (ΔCFI = .000 in each case), which is well within recommended thresholds for supporting invariance. Although chi-square difference tests were statistically significant (see Table 1), this is expected given the large sample size and does not indicate substantive misfit.

Taken together, these findings support configural, metric, and scalar invariance of the STARS across language groups. This indicates that the factor structure, factor loadings, and response thresholds can be considered equivalent across groups. Consequently, observed differences in scores can be interpreted as reflecting true differences in the underlying constructs, and comparisons of latent means across language groups are justified.

Table 1

Summary of model fit and changes in fit across configural, metric, and scalar invariance models.

Model	CFI	TLI	RMSEA	SRMR	ΔCFI
Configural	.994	.993	.065	.045	—
Metric	.994	.993	.064	.046	.000
Scalar	.994	.994	.059	.045	.000

9.6 Part 3: Worksheet

Code Recognition Tasks

Here’s all the code from this tutorial. Can you remember what each line of each code chunk does? Are there any code chunks that you struggle to make sense of (you would be forgiven for skipping over some of the detail in the formatted loadings table!)? Make sure to revisit the section in which it is used and take notes.

stars_data <- readr::read_csv(here::here("data/stars_mi_data.csv"))

stars_data |> dplyr::count(language)

stars_data |>
  dplyr::count(language) |>
  dplyr::summarise(
    ratio = n[language == "English"] / n[language == "Non-English"]
  )

stars_model <- '
interpret =~ stars_int_3 + stars_int_11 + stars_int_1 +
             stars_int_6 + stars_int_5 + stars_int_2
test =~ stars_test_3 + stars_test_6 + stars_test_4 +
        stars_test_1 + stars_test_8
help =~ stars_help_2 + stars_help_3 + stars_help_4
'

item_vars <- c( 
  "stars_int_3", "stars_int_11", "stars_int_1",
  "stars_int_6", "stars_int_5", "stars_int_2",
  "stars_test_3", "stars_test_6", "stars_test_4",
  "stars_test_1", "stars_test_8",
  "stars_help_2", "stars_help_3", "stars_help_4"
)

groups <- unique(stars_data$language) 

fits <- purrr::map(groups, ~ { 
  
  stars_grouped <- stars_data |>
    dplyr::filter(language == .x) |> 
    dplyr::select(dplyr::all_of(item_vars)) 
  
  lavaan::cfa( 
    model = stars_model,
    data = stars_grouped,
    estimator = "WLSMV",
    ordered = item_vars
  )
  
}) |>
  rlang::set_names(groups) 


fit_summary <- fits |> 
  purrr::map_dfr( 
    ~ lavaan::fitMeasures(.x, 
      c("cfi", "tli", "rmsea", "srmr")
    ),
    .id = "group" 
  )

fit_summary

loadings <- fits |>
  purrr::map_dfr( 
    ~ lavaan::parameterEstimates(.x, standardized = TRUE) |> 
      dplyr::filter(op == "=~") |> 
      dplyr::select(lhs, rhs, est, std.all), 
    .id = "group" 
  )
loadings

issues <- fits |> 
  purrr::map_dfr(
    ~ lavaan::parameterEstimates(.x) |> 
      dplyr::filter( 
        est < 0 & op == "~~" | 
        se > 10 |               
        abs(est) > 10        
      ),
    .id = "group" 
  )
issues

stars_data <- stars_data |>
  dplyr::mutate(language = forcats::as_factor(language))

fit_configural <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
  group = "language", 
  estimator = "WLSMV",
  ordered = item_vars
)
lavaan::summary(fit_configural, fit.measures = TRUE, standardized = TRUE)

fit_metric <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
  group = "language",
  group.equal = "loadings",
  estimator = "WLSMV",
  ordered = item_vars
)
lavaan::summary(fit_metric, fit.measures = TRUE, standardized = TRUE)

fit_scalar <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
  group = "language",
  group.equal = c("loadings", "thresholds"),
  estimator = "WLSMV",
  ordered = item_vars
)
lavaan::summary(fit_scalar, fit.measures = TRUE, standardized = TRUE)

lavaan::anova(fit_configural, fit_metric, fit_scalar)

lavaan::fitMeasures(fit_configural, c("cfi", "tli", "rmsea", "srmr"))
lavaan::fitMeasures(fit_metric, c("cfi", "tli", "rmsea", "srmr"))
lavaan::fitMeasures(fit_scalar, c("cfi", "tli", "rmsea", "srmr"))

### OPTIONAL SECTION ON PARTIAL INVARIANCE ####

lavaan::lavTestScore(fit_scalar)

params <- lavaan::parameterEstimates(fit_scalar)

params |>
  dplyr::select(lhs, op, rhs, label) |>
  dplyr::filter(label == ".p6.")

fit_partial <- lavaan::cfa(
  model = stars_model,
  data = stars_data,
  group = "language",
  group.equal = c("loadings", "thresholds"),
  group.partial = c("interpret=~stars_int_2"),
  estimator = "WLSMV",
  ordered = item_vars
)
lavaan::summary(fit_partial, fit.measures = TRUE, standardized = TRUE)

Worksheet

Described below is a simulated dataset for an imaginary scale. Have a read through, complete the coding tasks in Posit Cloud, and then see if you can correctly answer the worksheet questions below.

Digital Life Balance Scale (DLBS) Data

The dlbs_mi_data.csv dataset contains (simulated) responses from 400 participants to a 20-item Digital Life Balance Scale (DLBS).

Each item uses a 1–5 Likert scale (1 = strongly disagree, 5 = strongly agree). There are no reverse-scored items.

The scale is designed to capture different aspects of people’s experiences with their digital lives.

Our EFA revealed the following structure:

The retained items and their corresponding factors are below for reference (note that dlbs_20 has been removed):

Item	Factor	Wording
dlbs_9	Digital Overload	I feel left out when I see others posting about events.
dlbs_10	Digital Overload	I feel judged based on my online presence.
dlbs_6	Digital Overload	I feel pressure to present myself positively online.
dlbs_7	Digital Overload	I compare my life to others on social media.
dlbs_8	Digital Overload	I worry about how many likes or reactions I receive.
dlbs_1	Digital Self-Control	I feel overwhelmed by the number of notifications I receive.
dlbs_5	Digital Self-Control	My online activity leaves me feeling exhausted.
dlbs_4	Digital Self-Control	I find it hard to switch off from digital devices.
dlbs_2	Digital Self-Control	I struggle to keep up with messages across different platforms.
dlbs_3	Digital Self-Control	I often feel mentally drained after spending time online.
dlbs_14	Online Social Pressure	I check my phone without thinking about it.
dlbs_15	Online Social Pressure	I feel in control of my online habits.
dlbs_11	Online Social Pressure	I can easily limit my screen time when I need to.
dlbs_13	Online Social Pressure	I lose track of time when using apps.
dlbs_12	Online Social Pressure	I set boundaries around when I use digital devices.
dlbs_16	Online Engagement & Enjoyment	I enjoy interacting with others online.
dlbs_18	Online Engagement & Enjoyment	Being online helps me feel connected to others.
dlbs_19	Online Engagement & Enjoyment	I learn useful things through digital platforms.
dlbs_17	Online Engagement & Enjoyment	I find digital spaces inspiring.

This week, your task is to test measurement invariance in two groups: neurotype: neurodivergent and neurotypical.

Coding Tasks

Read in the dlbs_mi_data.csv dataset
Check group sizes and ratio (don’t worry about splitting the sample to check individual per-group CFAs)
Specify CFA measurement model
Fit configural model (you’ll need to specify dlbs-specific item_vars first)
Fit metric model
Fit scalar model
Compare fit

Questions

There are no interactive questions this week - they’re nice, but a bit glitchy, and open questions are better suited to this kind of analysis.

Instead, write your answers in the Quarto document and check your answers in the accompanying tutorial solutions file in Posit Cloud.

Are group sizes large enough and roughly equal?
Does the model fit reasonably well in both groups?
Are the relationships between items and their underlying factors are equivalent across groups?
Do group differences in observed scores reflect true differences in the underlying constructs?
Does adding equality constraints leads to a meaningful deterioration in model fit?
Report the MI results in-text and add relevant tables.