Modern Psychometric Analysis of the Muscle Strengthening Activity Scale (MSAS) Using Item Response Theory

Background : With the growing need to promote muscle strengthening activity (MSA) for improved health-related quality of life (HRQOL) comes the growing need for proper measurement of MSA behavior. The purpose of this study was to examine test and item functioning of the MSA scale (MSAS) using item response theory (IRT). Methods : The current research fit data from a sample of N = 400 respondents to two different graded response models (GRMs), a three-item muscular strength scale and a three-item muscular endurance scale. For each GRM, model-data fit was examined and IRT assumptions assessed. Results : An unconstrained GRM was found to fit the data better than the constrained model (Δ G 2Strength = 10.3, p = .006, RMSEA = .043 & Δ G 2Endurance = 7.0, p = .031, RMSEA = .021). GRM boundary location parameters covered the latent trait scale well for both strength ( b s: -4.26 to 2.58) and endurance ( b s: -3.86 to 1.79) scales with each item showing adequate fit to the data (all RMSEAs < .05). Test information was approximately evenly distributed around a theta of zero with summed information from theta ranges ±4 of 92.8 ( strength ) and 93.5% ( endurance ). Only 2.3 and 1.5% of persons misfit the strength and endurance GRMs, respectively. Conclusion : The MSAS has shown to be a valid tool for measuring MSA behavior in adults using modern psychometric theory.


Introduction
The current physical activity (PA) recommendations make it clear -adults should participate regularly in muscle strengthening activity (MSA) in order to gain the many different associated health benefits, including improved health-related quality of life (HRQOL) [1,2,3]. And like any health behavior, the ability to measure MSA would have vast implications for researchers, practitioners, clinicians, and educators. The muscle strengthening activity scale (MSAS) is a self-report assessment tool designed to measure MSA behavior in adults and has exhibited promising psychometric properties [4,5]. However, thus far, evidence supporting the MSAS has been limited to classical test theory (CTT) methods. CTT is the most prevalent and conventional model used by researchers to validate behavioral scales [6]. The focus of CTT is placed on the unweighted sum of responses across items of an instrument, otherwise known as the observed score (X). The CTT model states that an individual's observed score is a function of their true score (T) and random error (E) [7]. Where true score is the mean of an infinite number of independent observed scores from an infinite number of independent test administrations. Therefore, any given single observed score will differ from its mean due to error. Symbolically, the true score model for individual i is There are many flaws, however, in CTT that motivate researchers to search for other means of assessing the psychometric properties of a scale [8]. Firstly, CTT is not a testable model and provides no clear methods for assessing model-data fit. Secondly, under CTT, an observed score is influenced by the characteristics (e.g., difficulty) of the test (i.e., an easier test will result in higher observed and true scores). Lastly, conventional reliability coefficients are affected by characteristics of the individuals (i.e., observed scores with more variability will mathematically inflate the reliability of a test). This can be seen by viewing the symbolic form of reliability under CTT:

Research in Psychology and Behavioral Sciences
Where true score variance is estimated from the difference between observed score and error variances (numerator).
A more modern approach to scale development and validation, which can complement CTT-based research, is item response theory (IRT). IRT provides a system of mathematical equations which can model the relationship between latent traits (e.g., ability) and observed responses to items [9]. IRT models then can assess the functioning of each item to determine how well they perform in measuring the trait of interest. IRT can also provide a measure of reliability similar to CTT but with an added advantage of measuring how that precision varies across the latent trait [10]. Another large benefit of IRT is that item difficulty is estimated on an interval scale and the same scale as person ability, called theta (θ) [11]. Finally, IRT has strengths where CTT has limitations. That is, with IRT: 1) models can be tested for appropriate fit to the data, 2) item parameters are invariant to changes in persons (i.e., regardless of sample population), and 3) person parameters are invariant to changes in the test (i.e., easy vs. difficult tests) [12].
In summary, the ability to properly measure MSA is of increasing interest to researchers and related professionals concerned with the associations between PA and health outcomes in adults. IRT is a modern psychometric approach to validating self-report behavioral scales by examining how well each item in the instrument functions. Therefore, the purpose of this study was to employ IRT to examine the functioning of the MSAS. Specifically, the graded response model (GRM) was used to examine the appropriateness of each item of the MSAS in measuring MSA behavior in adults.

Study and Scale Development Procedures
The development procedures related to the MSAS have been explained in detail elsewhere [4,5]. Briefly, a total of N=400 adults who indicated participating in regular MSA provided responses to the MSAS. After an item analysis, the initial version of the MSAS resulted in a final sevenitem scale measuring three distinct MSA constructs: a three-item muscular strength construct, a three-item muscular endurance construct, and a single-item body weight exercise construct. The final version of the MSAS is enclosed in the appendix. Item stems for the three MSA scales consist of personalized statements regarding muscular strength training behavior, muscular endurance training behavior, and body weight exercise training behavior. For example, "I often exercise my muscles with heavy weight that I can lift 1 to 8 times". Each response scale contains the same five-category options ranging from "Never true" to "Always true". Two additional items are included in the MSAS that ask participants about their frequency and duration of MSA participation. These participation items are included to quantify amounts of MSA performed but are not evaluated in this study. Directions are given at the bottom of the MSAS to obtain strength, endurance, and body attribute scores as well as an MSA participation score.

Graded Response Model (GRM)
There are a number of different IRT models available for polytomous response items. Such options include the Rasch rating scale model (RSM) [13], Rasch partial credit model (PCM) [14], generalized versions of RSM [15] and PCM [16], and the nominal response model (NRM) [17]. This study, however, used the graded response model (GRM) [18,19]. The GRM is a generalization of the twoparameter logistic item response model (2PLM) for polytomous response items. The GRM was used over other IRT models because 1) MSAS has the same ordinallevel response options across its items, 2) both item difficulty as well as item discrimination parameters are estimated in the GRM, and 3) GRM allows the researcher to constrain (set equal) item discrimination parameters and perform a nested model test to examine the statistical usefulness of freely estimating the parameters -hence providing justification (or lack of) for separate item discrimination parameter estimates. As mentioned above, item discrimination was of interest in this study because of its ability to identify how strongly each item is associated with the MSAS latent traits [20]. Additionally, GRM item difficulty values relate to the level of the latent trait required where the respondent has a 50% chance of endorsing the current or higher response categories versus all lower categories [21]. Evaluating the extent to which location parameters within and between items cover the latent trait range is essential in determining how well the scale items function. In combination, both item parameters were sought in this study to properly assess MSAS scale functioning. The cumulative boundary response function (BRF) of the GRM is defined as Where θ (theta) is the latent trait, α j is the discrimination parameter for item j, δ Xj is the category boundary location (difficulty) for category Xj with k categories and k -1 category boundary locations for each item [22]. Plotting each cumulative BRF with respect to theta can be defined as an item's boundary characteristic curve (BCC) graph. In brief, the above equation specifies the probability of obtaining a category score of Xj or higher on item j. Therefore, to compute the probability of obtaining a particular category score of Xj or in a particular category k (P k ), the difference between cumulative probabilities for adjacent categories must be found. This is specifically shown as Where P * k and P * k are from the BRF above. Plotting P k across theta for each category can be defined as an item's category characteristic curve (CCC) graph.

Statistical Analyses
The following procedures were the same and performed separately for the strength and endurance scales of the MSAS. The IRT analyses were divided into three categories: descriptive statistics, scale calibration and assessment, and IRT assumption checking. For the descriptive statistics, item category response rates were reported, CTT reliability coefficients (i.e., Cronbach alpha) computed, and cumulative frequency histograms (i.e., Pareto charts) constructed for scale sum scores. For scale calibration and assessment, a series of six steps were followed. First, two competing GRMs were fit to the data, one with item discrimination parameters constrained (set equal). The GRMs were of homogenous class (item discrimination the same across category options) with parameters set to the logistic metric (scaling factor of 1.0). For both models Akaike's information criterion (AIC), sample size adjusted AIC, Bayesian information criterion (BIC), sample size adjusted BIC (SABIC), and root mean square error of approximation (RMSEA) were computed as measures of fit. Additionally, a likelihood ratio test between the two IRT models was conducted to determine if the estimation of extra parameters is statistically warranted. Models were fit using marginal maximum likelihood estimation (MMLE) with the Gauss-Hermite quadrature rule [23]. Second, parameter estimates were reported for the better fitting GRM, including item discrimination, category boundary location (difficulty), person ability, and their standard errors. Third, an item characteristic curve (ICC) graph was generated for each item to examine the probability of selecting an item category across the latent trait scale. When scale data are polytomous, an ICC technically becomes a BCC and hereafter referred to as such. Each BCC was evaluated to ensure item responses were in accord with the latent trait (theta). Fourth, an item response category characteristic curve (CCC) graph was generated for each item to examine the latent trait values at which the probability of selecting an item category or higher is 50%. Each CCC was evaluated for proper item functioning across the latent trait (theta). Fifth, test information (I) was computed across specific areas of the MSAS latent trait. Information tells us how certain we are about a person's location (theta) on the latent trait continuum and it has a reciprocal relationship with the standard error of estimate (SEE). More specifically, test information provides a way to quantify how well a scale discriminates across the latent trait. Based on the test information function provided by the fit GRMs, marginal reliability (MR) was computed and information graphs constructed. Test information was inspected to ensure consistent and wide coverage across the latent trait continuum. Sixth, summary statistics were computed and graphs constructed for MSAS construct person (theta) values.
For IRT assumption checking, three main assumptions were assessed: local independence, unidimensionality, and model fit. Local independence refers to a characteristic where responses to an item are independent of responses to any other item after controlling for person location (theta). This assumption was assessed using the local dependence (LD) chi-square statistic, standardized residuals, and Cramer's V coefficients [24]. Standardized residuals greater than the absolute value of 10.0 and Cramer's V of 0.40 or larger were considered problematic [25,26]. Unidimensionality refers to the notion that responses to items are solely a function of a single latent trait. This assumption was assessed using Velicer's minimum average partial (MAP) where a series of principal components are partialed out of the item correlation matrix to yield a series of partial correlation matrices [27,28]. The step that results in the lowest average squared (or 4 th power) partial correlation determines the number of components to retain. Finally, model fit was assessed by examining model and item RMSEA statistics where values ≤ .05 indicate adequate fit [29]. Additionally, a standardized statistic (Zh) was computed for each person where values greater than the absolute value of 2.0 were considered misfit to the GRM. Negative values of Zh reflect person responses that are inconsistent (unlikely) given the GRM and positive values of Zh reflect person responses that are more consistent than the GRM predicts [30]. The percentage of persons misfitting the model, using the Zh statistic, was used as a measure of model fit. All IRT analyses were conducted using the R ltm and mirt packages [31,32,33,34]. Table 1 contains item category endorsement distributions for the N=400 MSAS respondents. Each category received respondent endorsements. Although six categories saw an endorsement rate less than ten percent, this might be expected from a relatively small sample size. Additionally, reliability estimates shown (α Strength = .62 & α Endurance = .64) are acceptable for scales of this size. That is, using the Spearman-Brown Prophecy formula, we see that if each MSAS scale was doubled to a size of six items, the new scale reliability estimates would increase to .77 and .78 for strength and endurance, respectively. Figure 1 and Figure 2 both show the cumulative frequency distribution of the MSAS strength and endurance scale sum scores. Both graphs indicate sparse ceiling and floor sum scores from respondents. Table 2 contains model fit statistics for both the constrained and unconstrained GRMs. Although both models fit well (RMSEAs < .05), the likelihood ratio test indicates that the unconstrained GRM fits the sample data better (Δ G 2 Strength = 10.3, p = .006 & Δ G 2 Endurance = 7.0, p = .031). Therefore, hereafter, all results will summarize the unconstrained GRM. Table 3 contains GRM item parameter estimates for the MSAS. Item 1 displays the greatest discrimination in the strength scale (a = 2.77) whereas item 5 displays the greatest discrimination in the endurance scale (a = 2.29). These findings are consistent with the item-to-theta correlation (r Theta ) values, which also serves as a measure of item discrimination. The GRM boundary location parameters appear to cover the latent trait well for both the strength (bs: -4.26 to 2.58) and endurance (bs: -3.86 to 1.79) scales. With each item showing adequate fit to the data (all RMSEAs < .05). Inspection of item category coverage across theta is enhanced by visual inspection of the CCCs in Figure 3. It can be seen that each item category has its own area on the theta scale where its probability of endorsement is greater than any other category, indicating adequate item functioning. Figure 3 also visually indicates the same boundary location parameters displayed in Table 3, as the intersection of each CCC. However, the visual inspection of these boundary location parameters are improved by viewing the BCCs in Figure 4. Two main characteristics are noteworthy in Figure 4. One, the discrimination parameters are also visually noticeable. That is, each item category has the same discrimination (slope), indicative of the homogenous class of fitted models as well as the greatest slopes are again seen with item 1 and item 5 of the strength and endurance scales, respectively. Two, boundary locations (theta values where the probability of endorsing the category or higher is 0.50) are adequately spread across the latent trait scale when considering all scale items together. Table 4 contains test information values for each MSAS scale. There are two main noteworthy comments regarding these values. One, a large percentage of the total information can be collected from theta ranges of -4 to +4 using the strength (92.8%) and endurance (93.5%) scales. Two, this total information appears roughly equal for higher MSAS trait and lower MSAS trait. The marginal reliability (MR) values capture the average information across the theta scale, with acceptable values (given only three item scales) of .72 and .69 for the strength and endurance scales, respectively. Figure 5 and Figure 6 display the test information functions as well as the standard error of estimate lines. These graphs reinforce the information from Table 4 with the added detail of showing where on the theta scale the MSAS information is greatest. That is, for both MSAS scales, measurement precision is greatest around the theta value of zero. Unlike CTT measures of reliability, IRT test information has the added advantage of indicating the measured location of the person trait where reliability is strong or possibly where improvement is needed. For example, the MSAS scale may require more difficult items if we wish to measure extreme MSA behavior with a high level of precision.    Note. GRM is graded response model. GRMC is GRM with discrimination parameter constrained. G 2 is the likelihood ratio test statistic. RMSEA is the root mean-square error of approximation. AIC is the Akaike Information criterion. AICc is the sample size adjusted AIC. BIC is the Bayesian information criterion. SABIC is the sample size adjusted BIC. LL is the negative log-likelihood statistic. df is model degrees of freedom. Δ G 2 is the likelihood ratio test for nested IRT models.  Note. a is GRM discrimination parameter. b is GRM boundary location (difficulty) parameter. SE is standard error. r Theta is Pearson correlation coefficient between each item response score and IRT theta estimate. RMSEA is the root mean-square error of approximation.

28
Research in Psychology and Behavioral Sciences   Note. I is test information (%). MR is marginal reliability coefficient of theta.  Note. Zh is the standardized statistic for person fit to theta. r score is Pearson correlation coefficient between scale scores and GRM theta estimate.

Discussion
The results from this modern psychometric analysis clearly support the valid measurement of muscular strength and muscular endurance behavior using the MSAS. These results are backed by acceptable model fit, acceptable item fit, respectable item category functioning across theta, adequate distribution of test information, and the meeting of IRT model assumptions. Results from this study also complement previous research utilizing CTT methods and reinforce the strong psychometric evidence validating the MSAS [4,5]. Albeit, the current research does have some relatively weak aspects of noteworthy significance. One important comment, this study did present a wide item coverage across theta. However, this theta coverage did not extend to extreme levels of theta with high precision. Therefore, MSAS may not necessarily provide error free estimates for extreme MSA behavior. Further research may be warranted and should include more extreme MSA participants in the sample to determine if information collected increases at extreme theta values. A second important comment, with each MSAS scale, a single item presented as the dominant discriminating item. This also means that each scale includes two items with relatively lower item discrimination. A fact that also relates to lower test information. Despite this weak discrimination aspect, all items indeed fit the GRM. As well, the items displaying relatively lower discrimination also showed a response category with low endorsement (< 10%). Therefore, this weak aspect may be due to the relatively small sample size. Further research of the MSAS is suggested with larger samples (Ns > 1,000) where category endorsements, item discrimination, and test information may improve. Other research is also suggested to add to the body of psychometric evidence for the MSAS. Firstly, Rasch measurement is another set of IRT models that are indicated for polytomous response scales [35]. A Rasch measurement analysis of the MSAS could add to this body of evidence philosophically by assessing the extent to which the hypothesis that muscular strength and endurance are each quantitative and measurable traits. A Rasch measurement analysis could add to this body of evidence statistically by fixing the discrimination parameter equal across all items and assessing the extent to which the category difficulty parameters should be held constant across items (RSM) or freely estimated across items (PCM). Furthermore, with the ability to test the better fitting polytomous Rasch model (i.e., RSM vs. PCM), one can assess the extent to which the MSAS scale should be considered an interval-level (RSM) or ordinal-level (PCM) scale [36]. Secondly, an experimental study on the MSAS is suggested to further evaluate its ability to separate MSA participants with known trait differences. Lastly, an additional study is necessary to evaluate part II of the MSAS for its ability to quantify participation in MSA.

Conclusions
The MSAS is a seven-item self-report instrument that can measure three MSA constructs: a three-item muscular strength construct, a three-item muscular endurance construct, and a single-item body weight exercise construct. The results from this study provide modern psychometric evidence to support the valid measurement of muscular strength and muscular endurance behavior using the MSAS. Two additional items (nine items total) are included in the MSAS to quantify MSA participation and are in line for future validation. The MSAS is free to use without restrictions, providing proper citation.