Article Versions

Export Article

Cite this article

- Normal Style
- MLA Style
- APA Style
- Chicago Style

Research Article

Open Access Peer-reviewed

Simon Ntumi^{ }, Sheilla Agbenyo, Tapela Bulala

Received July 24, 2021; Revised November 11, 2021; Accepted February 23, 2022

**Background**: Within the space of classical test theory (CTT), alternate test forms are needed so that they can be applied to different groups or at different testing occasions. This CTT theoretical assumption urged the researchers to construct alternate test forms and estimate their parameters (µ, σ^{2} and Cσ^{2}). **Methods**: To obtain the parameter estimates (µ, σ^{2} and Cσ^{2}), three (3) alternate test forms (X1, X2 and X3) were carefully constructed and administrated to fifty-eight (58) business students at University Practice Senior High School in the Cape Coast metropolis, Ghana. One psychological test scale (DASS21) was also adopted as the form Y. The tests were administered to the students under suitable and conductive examination conditions and this ensured validity and reliability of the scores. **Findings**: After the statistical estimations, the study found that mean parameter of the four forms (X1, X2, X3 and Form Y) were unequal (µX1 ≠µX2 ≠µX3 ≠ µY). That is X1 (µ=7.23, n=58), X2 (µ=7.14, n=58), X3 (µ= 8.01, n=58) and Form Y-DASS21 (µ=7.92, n=58) p (0.306, CI95%) > 0.05. On the variance parameter, similar results were accrued as the test forms are not equal in their variances (σ^{2}X1X2≠σ^{2}X1X3≠σ^{2}X2 X3≠σ^{2}Y). This was reported as X1 (σ^{2} =6.120, n=58), X2 (σ^{2}=9.007, n=58), X3 (σ^{2}=8.040, n=58) and Form “Y” DASS21 recorded a variance of (σ^{2}=8.034, n=58) (p-value 0.121>0.05). Finally, on the covariance parameter, we found that the test forms were not equal (Cσ^{2}X1Y≠Cσ^{2}X2 Y≠Cσ^{2}X3Y). The result is reported as (X1= Cσ^{2} =5.338, n=58, p= 0.846), (X2= Cσ^{2}=6.023, n=58, p= 0.831) (X3= Cσ^{2}=7.898, n=58, p= 0.783). **Conclusions**: The study concluded that the constructed alternate test forms met the congeneric parallelism conditions. The estimated parameters were similar in content, where the µ, σ^{2} and Cσ^{2} were similar across all the test forms (X1, X2, X3 and Form Y).

Within the trajectory of classroom assessment, testing is known to be a multi-faceted and intricate field in which right decision-making is very complicated ^{ 1, 2}. Clearly, it is believed that in order for any evaluation to be reliable and valid, a number of considerations should be taken into consideration ^{ 3, 4, 5}. In fact, classroom assessment and evaluation usually lead into making decisions about individuals and situations; therefore, several consequences will follow as a result of the decisions. Some of these consequences are social or psychological, affecting individual’s motivation, goal, and even social status ^{ 5, 6}. Precisely, because of the importance given to test scores in our society, any mistake that may emerge from the test can have serious consequences in educational decision making. Classical true score theory is a simple model that describes how errors of measurement can influence observed score ^{ 7, 8, 9}.

In the work of ^{ 10}, parallel forms are seen as tests that are different subsets of the same universe of items, which capture the same attribute with the same accuracy. As a measurement model for the scores of parallel forms, parallel measures are assumed, so that the correlation between the scores, i.e., the parallel-forms reliability or parallel-test reliability matches the reliability of both forms. Parallel test construction has been of interest and is of interest to test developers ^{ 11, 12}.

Within the assumptions of CTT, test parameters which includes reliability, means, variances and covariances are central issues in testing can only be estimated from parallel tests ^{ 13}. Any time test developers talk of alternate test, what comes to mind is parallel. If parallel tests that are statistically equivalent to the original test can be developed, validities established for the original measures should also be applicable to the parallel measures for purposes of establishing the job-relevance and legal defensibility of the tests ^{ 14, 15}.

Reading the works of ^{ 16, 17} it is asserted that construction of a parallel test is not assured at the beginning of the test construction. This means that at the beginning of the test construction, the test developer can only consider them as alternate forms. Similarly, ^{ 18} stated that a careful consideration of item-by-item parallelism during development results in alternate forms that are parallel at the item level. Creating alternate forms that are parallel is what is termed as parallelism by ^{ 19}. Also, ^{ 10} pointed out that alternate test forms should consist of the same general and group factors in order to be considered as measures of the same construct(s), and that these tests should have equal true score means, standard deviations, and item intercorrelations.

In furtherance to the above ^{ 11} asserted that two or more alternate forms are said to be parallel depending on the closeness of the means, variances, covariance, content, true score consistency and the covariance of one form and other test. This means that parallel test is a matter of degree however, some forms of alternate forms may be more parallel than the other ^{ 12}.

The parallel model is the most restrictive measurement model for use in defining the composite true score. In addition to requiring that all test items measure a single latent variable (unidimensionality), the parallel model assumes that all test items are exactly equivalent to one another. All items must measure the same latent variable, on the same scale, with the same degree of precision, and with the same amount of error ^{ 13, 14}. All item true scores are assumed to be equal, and all error scores are likewise equal across items. For this form of parallel forms, there is content similarity, true score consistency, the means, variances, covariance of the forms and the covariance between one test form and any other test are all equal. This is what ^{ 15} classified as a true classical parallel test.

This form of parallelism deviates slightly from the classical parallel form in that, the means are not equal, and the true scores are not consistent. There is some sort of error margin in the true score either negative or positive, but all other features are same as the classical parallel. That is there is content similarity, variances, covariance of the forms and the covariance between one test form and any other test are all equal.

The tau-equivalent model is identical to the more restrictive parallel model, save that individual item error variances are freed to differ from one another. This implies that individual items measure the same latent variable on the same scale with the same degree of precision, but with possibly different amounts of error ^{ 11, 12}. All variance unique to a specific item is therefore assumed to be the result of error. The tau-equivalent model implies that although all item true scores are equal, each item has unique error terms This form also deviates from the classical parallel form to some degree. Here with the exception of the variances which are not equal, all other things are the same as the parallel forms.

The essentially tau-equivalent model is, as its name implies, essentially the same as the tau-equivalent model. Essential tau-equivalence assumes that each item measures the same latent variable, on the same scale, but with possibly different degrees of precision ^{ 12, 20}. Again, as with the tau-equivalent model, the essentially tau-equivalent model allows for possibly different error variances. The difference between item precision and scale is an important distinction to make. Whereas tau-equivalence assumes that item true scores are equal across items, the essentially tau-equivalent model allows each item true score to differ by an additive constant unique to each pair of variables ^{ 10, 13, 15}. This form of parallelism deviates to large degree from the classical parallel form which is considered as a true parallel form. For this form, the variance and mean are not equal; the true score is not consistent. However, there is content similarity, the covariances of the forms are equal and the covariance between one test form and any other test are all equal.

The congeneric model is the least restrictive, most general model of use for reliability estimation. The congeneric model assumes that each individual item measures the same latent variable, with possibly different scales, with possibly different degrees of precision, and with possibly different amounts of error ^{ 9, 11, 17}. Whereas the essentially tau-equivalent model allows item true scores to differ by only an additive constant, the congeneric model assumes a linear relationship between item true scores, allowing for both an additive and a multiplicative constant between each pair of item true scores ^{ 9, 10}.

Essentially, examinees who take credentialing tests and other types of high-stakes assessments are usually provided an opportunity to repeat the test if they are unsuccessful on initial attempts. To prevent examinees from obtaining unfair score increases by memorizing the content of speciﬁc test items, testing agencies usually assign an alternate form to repeat examinees. This appears to be missing in most Ghanaian classrooms where classroom teachers are only conversant with constructing only one test form.

Clearly, one of the key basis of classical test theory is to expose students, researchers, test developers etc. on how they can effectively use and classroom assessment principles. This is to help test users use statistical techniques and applied them to improve classroom assessment practices. Against this backdrop, we undertook this study to practically handle different forms of test by estimating their means (µ), variances (σ^{2}) and covariances (Cσ^{2}). The core purpose of this study was to estimate the means (µ), variances (σ^{2}) and covariances (Cσ^{2}) of alternate forms. The rationale was to find out the condition at which the test forms could meet degrees of parallelism.

In our quest to obtain data for the study, we carefully and extensively constructed three (3) alternate test items (that is: X1, X2 and X3) and one adapted psychological test (that is Y). We administered the test items forms (X1, X2, X3 and Form Y) with one of the Secondary Schools in the Cape Coast Metropolis (University Practice Senior High School). The three (3) alternate forms of core mathematics test and the one psychological test scale (DASS 21) used were administered to the students. To respond to the test items, fifty-eight (58) business students were selected for the study. The tests were administered under suitable and conductive examination conditions. These conditions were put in place to improve the validity and reliability of the test.

To ensure content similarities of the test forms, the Head of Department of the mathematics department of the school was contacted to know the topics/contents that students have covered. Also, some of the teachers of the form classes were contacted to confirm the topics covered and the favourable time to conduct the test. With the topics, a table of specification was prepared to ensure the construction of the items to cover all the topics and at appropriate level of thinking. This was done to serve as guide in the construction of the three alternate forms of test. Again, test specification which is made of item specifications was prepared for consistency and similarity of the test items for the three forms (X1, X2, and X3). These processes guaranteed and ensured some level of content similarity of the alternate forms.

The items were constructed using test specifications that covered seven (7) general topics in core mathematics (these include: sets and operation of sets, real number system, mapping, relations and functions, linear equations and inequalities, algebraic expressions, number bases and plane geometry). In constructing the items, most of the items (n=11) measured knowledge aspect of the students, those that measured comprehension followed (n=06). Those that measured application were the least (n=03). The items did not cover up to analysis, synthesis and evaluation aspect of the taxonomy. This was based on the assumption of Scully (2017) who asserted that most multiple-choice items are suitable in measuring lower order thinking of learners.

We further described the item specifications content, objectives and description of the test items. Here, we defined the content of the item thus, where the item was found in the syllabus. To inform our test takers, we also spelt out clearly the objective for constructing the test items which and explained the rationale for constructing item. Finally, we also provided a vivid description of the items by setting out our expectations. Example of how the items were constructed is specified in Table 1, Table 2 and Table 3. Before responding to the questions, we provided instructions for the students.** **The instructions stated that** **there are 20 set objectives with four options (A, B, C and D) for each the items. You are required to respond to all the questions by circling out the correct respond.

The analysis focused on the parameters that is; means, variances and covariances of the obtained alternate forms. In our quest to estimate these parameters of parallelism, standard deviation, F-values, correlations (relationship among test forms) and p-values were also reported. These values were accompanied with interpretations and discussions therefore. To estimate these key parameters, the administered and scored tests were analysed using descriptive and inferential statistics with the help of Statistical Package for Social Sciences (SPSS) v.25 software.

This aspect of the study reported the findings that emerged from the data administration and scoring. The findings are based on the estimated parameters.

**Estimating Mean Parameter of the Test Forms (X1, X2, X3 and Form Y)**

The mean is the average or the most common value in a collection of numbers. In statistics, it is a measure of central tendency of a probability distribution along median and mode. It is also referred to as an expected value. Our first task was to find out whether the mean scores of the test forms. To determine whether the means of the alternate forms are equal, we performed a descriptive analysis of the means scores on each form. If item 1 of form is realised to be faulty hence has to be bonus, scores of items 1 of all forms were excluded from the scores for the analysed. This is because bonus questions positively affect the scores of that form than the others. The result of the means is presented in Table 4.

The result in Table 4 shows the mean and standard deviation of X1 (µ=7.23, SD= 2.643), X2 (µ=7.14, SD=3.023), X3 (µ= 8.01, SD=3.195) and Form Y-DASS21 (µ=7.92, SD=3.232). As depicted, the results show that the means are very close and the standard deviations are approximately 3 for all the test forms. This shows the closeness or similarity in response of the students on the four (4) forms of the tests. The closeness of the means is confirmed by the F-value of .306 and sig. value of *p*=0.736 (CI_{95%}> 0.05) which implies that the mean values are not statistically significant suggesting that there are no differences in the mean scores of the students. In essence, the results from Table 4 showed that the means of the three forms of test and the Form “Y” (DASS21) were unequal or not the same (µX1 ≠µX2 ≠µX3 ≠ µY….).

**Estimating the Variance Parameter of the Test Forms (X1, X2, X3 and Form Y)**

The term *variance* refers to a statistical measurement of the spread between numbers in a data set. More specifically, *variance* measures how far each number in the set is from the mean and thus from every other number in the set. *Variance* is often depicted by the symbol σ^{2}. In this paper, one determinant of parallelism is to estimate the variance parameter. Our task was to find the test for the equality of the variances among all the obtained test forms. The result is presented in Table 5.

The variances of the scores of the alternate forms are presented in Table 5. The result is rpoerted as the that a variance of X1 (σ^{2}=6.120, n=58), X2 (σ^{2}=9.007, n=58), X3 (σ^{2 }=8.040, n=58) and Form “Y” DASS21 recorded a Variance of (σ^{2}=8.034, n=58). The accrued results suggest that even though the variances of the forms are not the same, however, the differences are not so much large. Viewing the results of the Levene statistics of test of homogeneity of the variance, the results show that the variances are assumed equal statistically across the forms (*p*=0.121 >0.05, CI_{95%}). In essences, it is therefore evident that the variance parameters among the test forms were not equal (σX1X2≠σX1 X3≠σX2 X3≠σY).

**Estimating the Covariance Parameter of the Test Forms (X1, X2, X3 and Form Y)**

*Covariance* measures the total variation of two random variables from their expected values. Using *covariance*, one can only gauge the direction of the relationship (whether the variables tend to move in tandem or show an inverse relationship). In this paper, one of the core objectives was to estimate the covariance parameter of the test forms. That is, we wanted to find out if the covariances are equal among the test forms. To achieve this, the scripts were coded so that scores on all four (4) forms could be entered for a particular student. The scores were entered without treating the test as a factor. This made it possible to estimate the correlation between each pair. The obtained results are presented in Table 6.

The result in Table 6 shows that the covariance of X1 and X2 was (Cσ^{2}=5.338, n=58) with correlation of 0. 846. For X1 and X3, the covariance was recorded as (Cσ^{2}= 6.023, n=58) with correlation of 0.831 and that of X1 and Form Y was recorded as (Cσ^{2}=7.898, n=58) with correlation of 0.783, all being significant at *p *< 0.01 (CI_{95%}). The results suggest that the covariances are not the same for all the test forms. In other words, there are differences in the covariances of the test forms (Cσ^{2}X1Y≠ Cσ^{2}X2 Y≠ Cσ^{2}X3 Y≠). The overall results suggest that we could not obtain equal parameters for the test items. Therefore, our parallelism of test items** **could only meet the congeneric model condition where the test forms had similar content and are only alternative to each other.

The ensued findings from the study gave ample evidence to settle that, the multidimensionality of items as well as of the test form itself makes it difficult to obtain equal parameters that is same means (µ), variances (σ^{2}) and covariances (Cσ^{2}) of alternate forms. The results suggest that practically it is highly impossible to generate a pool of items that adequately represents the content of the test and have same difficulty level. The results generated in this study suggest that the parallelism was congeneric model. This model assumes that each individual item measures the same latent variable, with possibly different scales, with possibly different degrees of precision, and with possibly different amounts of error ^{ 3, 8, 16, 18}.

The findings give reasons to agree with the assertions of classical scholars such as ^{ 4, 5} who asserted that it may be impossible to achieve such a goal according to the classical test theory (CTT)-based definition of parallel tests, in which true scores and variances of observed test scores across forms must be identical for any possible subpopulation of examinees. When CTT conditions for parallelism are not strictly met, post-administration equating and passing score determination is adapted to adjust for differences among test forms so that it makes no difference which form an examinee takes

Relatedly, it is asserted by previous authors that using an item bank to construct parallel forms of multidimensional measures also creates problems because often the constructs making up the content of these measures are not well understood, and thus it is difficult to replicate the original test, or even to attempt to create separate item banks for different content areas because content domains are difficult to identify. If a single item bank is used, the number of items needed to create a pool representing the content of the original measure would be so large that resulting alternate forms would tend to have an unstable content structure ^{ 2, 13}.

The essence of the study lends strong support claims that alternate test forms are designed to avoid or reduce content- or item-specific practice effects that are associated with repeated administrations of the same neuropsychological test(s) ^{ 16, 17}. Relatedly, examination manuals for many intellectual and neuropsychological tests illustrate that practice effects are common, especially over brief retest intervals (e.g., days or weeks) and this could lead to unequal parameters of the test items. Our results lend ample evidence to the work of ^{ 19} who indicated in any test construction, alternate test forms should include the same number of items, and the items should be of equivalent and have the same content though it may be difficult to obtain the same parameters (equal means, variances and covariances).

From the results accumulated from the study, we could infer that parallel test development procedures are complicated, however, by characteristics of the original test, test developers, classroom teachers etc. could easily construct similar test items to measure the similar constructs or traits of the test takers ^{ 2, 11}. However, we must reiterate that if the test or items are multidimensional, it is difficult to construct an alternate form which is parallel in content using more traditional development procedures (sampling similar test items from a general content item pool). In response to these issues, scholars have proposed that test developers must be guided by subject content to produce alternate test forms which are parallel in terms of means, standard deviation, and factor structures ^{ 20}.

For classroom and standardized testing purposes, this paper exposes students, classroom teachers and researchers to how parameter of alternate test items that is means (µ), variances (σ^{2}) and covariances (Cσ^{2}) could be estimated. It is observed that out of the conditions necessary for parallelism, the paper only met the congeneric model condition where the alternate forms had similar content but different parameter estimates (X1≠X2≠X3≠Y). From the study, it is instructive for classroom teachers and assessment practitioners to note that constructing test items with similar contents is very beneficial. This is because it helps in measuring a construct of interest and to others who are attempting to preserve the validities (and the legal defensibility) of the original form of the test. Therefore, classroom teachers and standardized testing companies must give keen interest in constructing alternate test forms.

**CV**: Covariances; **CTT**: Classical Test Theory; **CI**: Confidence Interval; **DASS 21**: Depression, Anxiety and Stress Scale - 21 Items; **SPSS**: Statistical Package for Social Sciences

The authors would like to thank their anonymous reviewers for their academic stimulation and constructive criticism throughout the development of the paper. We are again grateful to our colleague senior lecturers in educational measurement and evaluation for their input in the paper.

No conflict of interest exists. Clearly, we wish to confirm that there are no known conflicts of interest associated with this publication, and there has been no significant financial support for this work that could have influenced its outcome.

Not applicable.

Not applicable.

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

**SN**^{1}** **conceived the study, drafted methodology and performed all the analysis and drew conclusions. **SA**^{2 }drafted the introduction of the study. **TB**^{3}** **discussed the paper. All the authors (**SN**^{1}**, SA**^{2}**, TP**^{3}) reviewed multiple drafts and proposed additions and modifications. SN had the final responsibility to submit the paper. All the authors read and approved the final manuscript.

No funding was received for this study.

[1] | Clause, C.A., Mullins, M. E., Nee, M. T. Pulakos, E. & Schmitt, N. (2016). Parallel test form development: A procedure for alternate predictors and an example. Personnel Psychology, 6(51), 1-287. | ||

In article | View Article | ||

[2] | Cronbach, L. J. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12(1), 1-16 | ||

In article | View Article PubMed | ||

[3] | Drasgow, F. (2016). Technology and testing: Improving educational and psychological measurement. New York: Routledge. | ||

In article | |||

[4] | Gierl, M., Daniels, L., & Zhang, X. (2017). Creating parallel forms to support on-demand testing for undergraduate students in psychology. Journal of Measurement and Evaluation in Education and Psychology, 8(3), 288-302. | ||

In article | View Article | ||

[5] | Hilger, N., & Beauducel, A. (2017). Parallel-forms reliability. In Encyclopedia of Personality and Individual Differences (pp. 1-3). Springer, Cham. | ||

In article | View Article | ||

[6] | Kowalski, I. M., Protasiewicz-Fałdowska, H., Dwornik, M., Pierożyński, B., Raistenskis, J., & Kiebzak, W. (2014). Objective parallel-forms reliability assessment of 3-dimension real time body posture screening tests. BMC Pediatrics, 14(1), 1-8. | ||

In article | View Article PubMed | ||

[7] | Lord, F. M & Novick, R. M. (2000). Statistical theories of mental test scores. Educational testing services: New York University. | ||

In article | |||

[8] | Lovibond, S.H. & Lovibond, P.F. (2014). Manual for the depression anxiety & stress scales. (2^{nd} Ed.) Sydney: Psychology Foundation. | ||

In article | |||

[9] | Luecht, R. M. (2016). Computer-based test delivery models, data, and operational implementation issues. In F. Drasgow (Ed.), Technology and testing: Improving educational and psychological measurement (pp. 179-205). New York: Routledge. | ||

In article | |||

[10] | Miller, J., & Ulrich, R. (2003). Simple reaction time and statistical facilitation: A parallel grains model. Cognitive Psychology, 46(2), 101-151. | ||

In article | View Article | ||

[11] | Raykov, T. (2015). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21(2), 173-184. | ||

In article | View Article | ||

[12] | Raykov, T., Patelis, T., & Marcoulides, G. A. (2011). Examining parallelism of sets of psychometric measures using latent variable modeling. Educational and Psychological Measurement, 71(6), 1047-1064. | ||

In article | View Article | ||

[13] | Scully, D. (2017). Constructing multiple-choice items to measure higher-order thinking. Practical Assessment, Research & Evaluation, 22(4), 4-13. | ||

In article | |||

[14] | Sharma, P., Dunn, R. L., Wei, J. T., Montie, J. E., & Gilbert, S. M. (2016). Evaluation of point-of-care PRO assessment in clinic settings: integration, parallel-forms reliability, and patient acceptability of electronic QOL measures during clinic visits. Quality of Life Research, 25(3), 575-583. | ||

In article | View Article PubMed | ||

[15] | Singhal, S. P., & Sridevi, M. (2019). Comparative study of performance of parallel Alpha Beta Pruning for different architectures. In 2019 IEEE 9th International Conference on Advanced Computing (IACC) (pp. 115-119). IEEE. | ||

In article | View Article | ||

[16] | Sireci, S., & Zenisky, A. (2016). Computerized innovative item formats: Achievement and credentialing. In S. Lane, M. Raymond, & T. Haladyna (Eds.), handbook of test development (2nd ed., 313-334). New York: Routledge. | ||

In article | |||

[17] | Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60(2), 174-195. | ||

In article | View Article | ||

[18] | Wolfinger, R. D. (2014). Heterogeneous variance: covariance structures for repeated measures. Journal of Agricultural, Biological, And Environmental Statistics, 8(7), 205-230. | ||

In article | |||

[19] | Wu, S. L., Tio, Y. P., & Ortega, L. (2021). Elicited imitation as a measure of L2 proficiency: New insights from a comparison of two L2 English parallel forms. Studies in Second Language Acquisition, 8(7), 1-30. | ||

In article | View Article | ||

[20] | Yarnold, P. R. (2014). How to Assess the Inter-Method (Parallel-Forms) Reliability of Ratings Made on Ordinal Scales: Emergency Severity Index (Version 3) and Canadian Triage Acuity Scale. Optimal Data Analysis, 3(4), 50-54. | ||

In article | |||

Published with license by Science and Education Publishing, Copyright © 2022 Simon Ntumi, Sheilla Agbenyo and Tapela Bulala

This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Simon Ntumi, Sheilla Agbenyo, Tapela Bulala. Parallelism of Test Items: Estimating the Means (µ), Variances (σ^{2}) and Covariances (Cσ^{2}) of Alternate Test Forms. *International Journal of Data Envelopment Analysis and *Operations Research**. Vol. 3, No. 1, 2022, pp 1-7. http://pubs.sciepub.com/ijdeaor/3/1/1

Ntumi, Simon, Sheilla Agbenyo, and Tapela Bulala. "Parallelism of Test Items: Estimating the Means (µ), Variances (σ^{2}) and Covariances (Cσ^{2}) of Alternate Test Forms." *International Journal of Data Envelopment Analysis and *Operations Research** 3.1 (2022): 1-7.

Ntumi, S. , Agbenyo, S. , & Bulala, T. (2022). Parallelism of Test Items: Estimating the Means (µ), Variances (σ^{2}) and Covariances (Cσ^{2}) of Alternate Test Forms. *International Journal of Data Envelopment Analysis and *Operations Research**, *3*(1), 1-7.

Ntumi, Simon, Sheilla Agbenyo, and Tapela Bulala. "Parallelism of Test Items: Estimating the Means (µ), Variances (σ^{2}) and Covariances (Cσ^{2}) of Alternate Test Forms." *International Journal of Data Envelopment Analysis and *Operations Research** 3, no. 1 (2022): 1-7.

Share

[1] | Clause, C.A., Mullins, M. E., Nee, M. T. Pulakos, E. & Schmitt, N. (2016). Parallel test form development: A procedure for alternate predictors and an example. Personnel Psychology, 6(51), 1-287. | ||

In article | View Article | ||

[2] | Cronbach, L. J. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12(1), 1-16 | ||

In article | View Article PubMed | ||

[3] | Drasgow, F. (2016). Technology and testing: Improving educational and psychological measurement. New York: Routledge. | ||

In article | |||

[4] | Gierl, M., Daniels, L., & Zhang, X. (2017). Creating parallel forms to support on-demand testing for undergraduate students in psychology. Journal of Measurement and Evaluation in Education and Psychology, 8(3), 288-302. | ||

In article | View Article | ||

[5] | Hilger, N., & Beauducel, A. (2017). Parallel-forms reliability. In Encyclopedia of Personality and Individual Differences (pp. 1-3). Springer, Cham. | ||

In article | View Article | ||

[6] | Kowalski, I. M., Protasiewicz-Fałdowska, H., Dwornik, M., Pierożyński, B., Raistenskis, J., & Kiebzak, W. (2014). Objective parallel-forms reliability assessment of 3-dimension real time body posture screening tests. BMC Pediatrics, 14(1), 1-8. | ||

In article | View Article PubMed | ||

[7] | Lord, F. M & Novick, R. M. (2000). Statistical theories of mental test scores. Educational testing services: New York University. | ||

In article | |||

[8] | Lovibond, S.H. & Lovibond, P.F. (2014). Manual for the depression anxiety & stress scales. (2^{nd} Ed.) Sydney: Psychology Foundation. | ||

In article | |||

[9] | Luecht, R. M. (2016). Computer-based test delivery models, data, and operational implementation issues. In F. Drasgow (Ed.), Technology and testing: Improving educational and psychological measurement (pp. 179-205). New York: Routledge. | ||

In article | |||

[10] | Miller, J., & Ulrich, R. (2003). Simple reaction time and statistical facilitation: A parallel grains model. Cognitive Psychology, 46(2), 101-151. | ||

In article | View Article | ||

[11] | Raykov, T. (2015). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21(2), 173-184. | ||

In article | View Article | ||

[12] | Raykov, T., Patelis, T., & Marcoulides, G. A. (2011). Examining parallelism of sets of psychometric measures using latent variable modeling. Educational and Psychological Measurement, 71(6), 1047-1064. | ||

In article | View Article | ||

[13] | Scully, D. (2017). Constructing multiple-choice items to measure higher-order thinking. Practical Assessment, Research & Evaluation, 22(4), 4-13. | ||

In article | |||

[14] | Sharma, P., Dunn, R. L., Wei, J. T., Montie, J. E., & Gilbert, S. M. (2016). Evaluation of point-of-care PRO assessment in clinic settings: integration, parallel-forms reliability, and patient acceptability of electronic QOL measures during clinic visits. Quality of Life Research, 25(3), 575-583. | ||

In article | View Article PubMed | ||

[15] | Singhal, S. P., & Sridevi, M. (2019). Comparative study of performance of parallel Alpha Beta Pruning for different architectures. In 2019 IEEE 9th International Conference on Advanced Computing (IACC) (pp. 115-119). IEEE. | ||

In article | View Article | ||

[16] | Sireci, S., & Zenisky, A. (2016). Computerized innovative item formats: Achievement and credentialing. In S. Lane, M. Raymond, & T. Haladyna (Eds.), handbook of test development (2nd ed., 313-334). New York: Routledge. | ||

In article | |||

[17] | Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60(2), 174-195. | ||

In article | View Article | ||

[18] | Wolfinger, R. D. (2014). Heterogeneous variance: covariance structures for repeated measures. Journal of Agricultural, Biological, And Environmental Statistics, 8(7), 205-230. | ||

In article | |||

[19] | Wu, S. L., Tio, Y. P., & Ortega, L. (2021). Elicited imitation as a measure of L2 proficiency: New insights from a comparison of two L2 English parallel forms. Studies in Second Language Acquisition, 8(7), 1-30. | ||

In article | View Article | ||

[20] | Yarnold, P. R. (2014). How to Assess the Inter-Method (Parallel-Forms) Reliability of Ratings Made on Ordinal Scales: Emergency Severity Index (Version 3) and Canadian Triage Acuity Scale. Optimal Data Analysis, 3(4), 50-54. | ||

In article | |||