Doctoral Dissertations from the Assessment & Measurement Program
by Robin D. Anderson, PsyD (2001)
Advisor: Dr. Donna Sundre
The purpose of this study was to examine whether one of the most common standardized testing procedures, the collection of demographic information prior to testing, facilitates performance decrements in subjects for whom a negative domain performance stereotype exists. The primary investigation involved examining whether the presence of a gender identification section on an optical readable form and the request that the gender section of the form be completed was a priming stimulus sufficient to trigger a stereotype threat effect. This study provided a real world adaptation of previous stereotype threat research. Results indicate that the inclusion of a gender identification item is not a sufficient priming stimulus to trigger stereotype threat patterns in low-stakes assessments. Results do indicate, however, that the removal of such an item may increase motivation and performance for both negatively and positively stereotyped groups.
by Susan K. Barnes, Ph.D. (2010)
Advisor: Dr. J. Christine Harmes
In this era of increased accountability in education, there is an urgent need for tools to use in assessing the abilities and instructional levels of young children. Computers have been used successfully to assess the abilities and achievements of older children and adults. However, there is a dearth of empirical research to provide evidence that computer-based testing (CBT) is appropriate for use with typically developing children under the age of six.
The purpose of this study was to explore the feasibility of using CBT with children in preschool and kindergarten. Children were administered paper-and-pencil (PPT) and CBT versions of the rhyme awareness subscale of the Phonological Awareness Literacy Screening (Preschool). After completing each assessment, each child shared individual reactions by selecting a card illustrating an emotion (e.g., joyful, happy, bored, sad, angry) and participating in a brief interview. Parents and teachers completed short questionnaires describing each child’s previous computer experience, fine motor skills, and ability to recognize and generate rhymes.
An embedded mixed methods design was used to explore (a) to what extent children could complete the CBT independently, (b) how children reacted to the tests, and (c) how the results from the CBT and the PPT compared. Interview transcripts and field notes were used to more fully explain the test results. Findings indicated that preschool and kindergarten children needed help with the CBT. Difficulties were related to using the mouse and following directions. About 12% of the kindergarteners needed adult support to finish the CBT, compared to nearly half of the preschoolers. Children of all ages reported enjoying using the computer and doing the rhyming tasks, however, many preschoolers appeared anxious to leave the testing area or tried to discuss topics unrelated to the assessment. For preschoolers, there was a test administration mode effect; the CBT was more difficult than the PPT. These results have implications for test development and use. CBTs for preschoolers must be designed to meet their physical and cognitive developmental needs. Also, preschool children need adequate practice using computer hardware and software before they can reliably demonstrate their skills and abilities through CBT.
Examining Change in Motivation Across the Course of a Low-Stakes Testing Session: An Application of Latent Growth Modeling
by Carol L. Barry, Ph.D. (2010);
Advisor: Dr. Sara J. Finney
As the emphasis on accountability in education increases, so does the prevalence of low-stakes testing. It is essential to understand test-taking motivation in low-stakes contexts, as low motivation has implications for the validity of inferences made from test scores about examinee knowledge and ability. The current study expanded upon previous work by exploring the existence of types of test-takers characterized by qualitatively different patterns of test-taking effort across the course of a three-hour low-stakes testing session. Mixture modeling results did not support the existence of types of test-takers for this sample of upperclass examinees. Latent growth modeling results indicated that change in effort across the testing session was well-represented by a piecewise growth form, wherein effort increased from the first to fourth test and then decreased from the fourth to fifth test. Further, there was significant variability in effort for each test as well as in rates of change in effort. The inclusion of external predictor variables indicated that whether an examinee attended the regular testing session versus a makeup session, mastery approach goal orientation, conscientiousness, and agreeableness partly accounted for variability in effort for the various tests, whereas only agreeableness was related to rates of change in effort. Additionally, the degree to which examinees viewed a particular test as important was weakly to moderately related to effort for a difficult, cognitive test but not for less difficult, noncognitive tests. Further, change in test-taking effort was not related to change in perceived test importance. These results have important implications both for assessment practice and the use of motivation theories to understand test-taking motivation.
by Anna Katherine Busby, Ph.D. (2005)
Advisor: Dr. Christine DeMars
This study provides validity evidence for the use of the Leadership Attitudes and Beliefs Scale III (LABS III; Wielkiewicz 2000) scores. The scale is based upon the ecology theory of leadership (Allen, Stelzner, & Wielkiewicz, 1998), and is designed to measure the attitudes and beliefs college students have toward leadership. This study was conducted with 845 college students at a large, mid-western, urban institution. The content of the LABS III items was examined to determine the relationship between the ecology theory of leadership and the scale. The items did not completely represent of the ecology theory. A confirmatory factor analysis (CFA) was conducted to test the hypothesized two-factor model, and the data did not fit the hypothesized model well. The scale was modified using theoretically-supported model modifications and additional research questions were explored. The modified LABS III scores were correlated with scores from the Miville-Guzman Universality-Diversity Scale-Short Form (Fuertes, Miville, Mohr, Sedlacek, & Gretchen, 2000). A moderate correlation was found and this result supported the hypothesis that there is a relationship between attitudes toward diversity and attitudes toward leadership. The modified LABS III scores were also correlated with the subscale scores of the Student Leadership Practices Inventory (Posner & Brodsky, 1992). Moderate correlations were found and this result supports the hypothesis that leadership attitudes are related to leadership practices. It was hypothesized that age would be strongly correlated with leadership attitudes; however, the results did not support this hypothesis. The results also supported the hypothesis that men and women differ in their attitudes toward leadership. Further examination of the ecology theory of leadership in relation to the LABS III and the LABS III factor structure is recommended. The results from this study suggest that a number of theory-based hypotheses were supported. However, continued refinement of the theory and its relationship to the scale needs to be explicated. Only through continued reflection and careful study can the nomological net of the ecology theory of leadership be developed and contribute to research in leadership.
Invariance of the Modified Achievement Goal Questionnaire Across College Students with and without Disabilities
by Hilary Lynne Campbell, Ph.D. (2007)
Advisor: Dr. Dena Pastor
As an increasing number of students with disabilities (SWDs) is taking part in postsecondary education, postsecondary institutions must meet the needs of this unique population. Because it is linked to important achievement-related outcomes, one area in which educators have historically tried to meet students' needs is achievement goal orientation (AGO). Educators must ensure that they are able to measure AGO for SWDs and to determine whether SWDs would benefit from different services or educational methods than their nondisabled peers. In the K-12 literature, studies suggest that SWDs may have different AGO profiles than their peers, but no such research has been conducted for college students.>One specific instrument designed to measure AGO, the modified Achievement Goal Questionnaire (AGQ-M; Finney, Pieper, & Barron, 2004) was administered to college students with and without disabilities. Confirmatory factor analyses were conducted with both populations to test the four-factor structure of AGO (Mastery-Approach, Mastery-Avoidance, Performance-Approach, Performance-Avoidance). Next, a series of tests were conducted to test the measurement and structural invariance of the AGQ-M across students with and without disabilities. Finally, latent means for the two samples on each dimension of AGO were compared.The four-factor model of AGO fit both samples well. Further, invariance of factor loadings (metric invariance), intercepts (scalar invariance), error variances, factor variances, and factor covariances were supported. Since the AGQ-M was found to be invariant, latent means were compared. In contrast to previous findings in the literature, results indicated no significant or practically meaningful differences between these two groups on any of the four dimensions of the AGQ-M. These results suggest that college students with and without disabilities may not have markedly different AGO profiles. Results may differ from previous findings because the sample of SWDs in this study had already completed several semesters of college at a moderately selective institution; these students likely differed in important ways from the general population of SWDs. This study lays the groundwork for a host of future studies, including replication studies, involving specific disability groups, and linking AGO profiles to external achievement-related variables for college students with and without disabilities.
Using explanatory item response models to examine the impact of linguistic features of a reading comprehension test on English language learners
by Jaime A. Cid, Ph.D. (2009)
Advisors: Dr. Dena Pastor and Dr. Joshua Goodman
The unintended consequences of high-stakes testing decisions made on scores that may vary as a function of language proficiency have been noted as a major threat to English language learners (ELLs) (Herman & Abedi, 2004; Mahoney, 2008). While several studies have focused on the effects of language proficiency in high-stakes science and math examinations, the impact of English language proficiency on reading comprehension tests has received far less attention. Furthermore, the effects that specific linguistic features of reading comprehension tasks have on ELL's test performance have been noticeably understudied. The overall aim of this study was to examine the impact of seven linguistic features (false cognates, homographs, negative wording, propositional density, surface structure, syntactic complexity, and vocabulary) of high-stakes reading comprehension test on Spanish-speaking ELLs using explanatory item response models conceptualized as Hierarchical Generalized Linear Models (HGLMs). More specifically, in a 40-item reading test explanatory item response models were used to investigate: (a) differential item functioning (DIF) for ELLs and non-ELLs in a traditional manner; (b) whether items consisting of certain linguistic features were differentially difficult; (c) the extent to which linguistic features may be differentially difficult for ELLs in comparison to non-ELLs; and (d) whether the difficulty of the items with such linguistic features varied across ELL with different years of formal exposure to Spanish as primary language of academic instruction. The results of investigating DIF in a traditional manner revealed that six items (four favoring non-ELL and two favoring ELLs) displayed DIF with group differences of at least half a logit. The estimates of the effects of the seven linguistic features were statistically significant ( p < 0.0001). However, only false cognates, negative wording, surface structure, and vocabulary increased the difficulty of an item. The differential functioning of the seven linguistic features revealed that the log-odds of getting a typical item right were 0.4867 logits lower for ELLs compared to non-ELLs. However, from a practical significance perspective, the linguistic features were not differentially difficult for the two groups. While the results of the linguistic feature combinations showed that the majority of the features displayed differential difficulty in favor of non-ELLs, none of them can be considered of practical significance. Finally, items with only false cognates were less difficult for ELLs with more years of exposure to Spanish as primary language of academic instruction. The benefits of the explanatory properties of English language status as a person-level predictor in a reading comprehension test along with practical implications of the current research and directions for future research are discussed.
Methods for Identifying Differential Item and Test Functioning: An Investigation of Type I Error Rates and Power
by Amanda M. Dainis, Ph.D. (2008)
Advisors: Dr. J. Christine Harmes and Dr. Christine DeMars
This study examined bias, and therefore fairness, by investigating methods used for identifying differential item functioning (DIF). Four DIF-detection methods were applied to simulated data and empirical data. These techniques were selected to focus on a relatively new method, DFIT, and compare it to another IRT-based method (likelihood ratio test), and two Classical Test Theory-based methods (logistic regression and Mantel-Haenszel). Within the simulation study, four factors were manipulated: sample size, the presence and absence of impact, the uniformity and non-uniformity of the DIF, and the magnitude of the DIF. The Type I error and power rates of the methods were examined, and results indicated that the performance of the methods depended on the data conditions. The DFIT method had low Type I error rates across all simulated conditions. Regardless of the absence or presence of impact, the likelihood ratio test and the logistic regression main effect test had elevated Type I error rates under both sample size conditions. While the Mantel-Haenszel method's error rates were satisfactory across all conditions, its power was low when detecting non-uniform DIF. High power was demonstrated by the DFIT and likelihood ratio methods, but the logistic regression method yielded unsatisfactory power rates under the impact present condition. The DFIT method, as the central focus of this investigation, warrants further attention. A particular concern is the method's performance when applied to smaller sample sizes, due to fitting a 3PL model to a dataset with insufficient sample size. Another area for further investigation is the Item Parameter Replication (IPR) procedure, which is used to establish statistical significance within the DFIT framework. Although it has proven to be a reasonably efficient technique for establishing statistical significance, its conservative performance in the empirical portion of this study suggests the need for further examination under conditions with smaller amounts of DIF. DIF detection plays an integral part in constructing a fair and unbiased test. Based on empirical evidence, such as that reported here, researchers and practitioners should examine how an item or test is functioning statistically before spending resources to examine a conceptual, underlying cause of DIF.
by Susan Lynn Davis, Ph.D. (2005)
Advisor: Dr. Sara Finney
Assessing student development can be a challenge in that such constructs are difficult to define and difficult to measure. However, the need exists for universities to understand student's personal development as they progress though college. Although there are many important facets of student development worthy of examination, this study focused on one aspect of development commonly referenced in university mission statements: students' premonition for lifelong learning. Previous research has noted the difficulty in determining if universities are creating lifelong learners; however, this study attempted to examine this development by means of a related concept: student achievement goal orientation. One cohort of students was assessed on three occasions during college to estimate change in five dimensions of student achievement goal orientation: mastery-approach, performance-approach, mastery-avoidance, performance-avoidance, and work-avoidance. In addition to addressing the need for information on student development, this study attempted to address the shortcomings of prior longitudinal research, for example, by employing specific methodologies that allow inclusion of partial records, estimation of individual variation within change, examination of measurement invariance, and fluctuation within patterns of change. Before estimating change over time, it was first determined that the measurement of goal orientation was psychometrically stable across the three assessments, as indicated by the sufficient level of measurement invariance. Change was estimated using Latent Growth Modeling which allowed the estimated pattern of change to be explicitly identified and described. Individual variation in change was also found and used to address ancillary research questions regarding change across dimensions of goal orientation and the relationship between initial goal orientation and change in goal orientation. All five dimensions of goal orientation exhibited significant change across the three assessments. The identified patterns of change present interesting information for student development and student motivation. Discussion of this estimated change includes exploration of the change in terms of achievement goal orientation, students' motivational perspective, and the development of lifelong learners.
Construct Validity Evidence for University Mattering: Evaluating Factor Structure, Measurement Invariance, and Latent Mean Differences of Transfer and Native Students
by Megan France (2011)
Advisor: Dr. Robin D. Anderson
The psychological construct university mattering is defined as the feeling that one makes a difference and is significant to his or her university’s community. University mattering emerged from the theory of general mattering, which describes mattering as a complex construct consisting of the facets awareness, importance, ego-extension and reliance. Researchers have attempted to operationalize university mattering through the development of various measures. Specifically, the Mattering Scale for Adults in Higher Education (MHE), the College Mattering Inventory (CMI) and the University Mattering Scale (UMS). The MHE and CMI were not developed based on an underlying theory of mattering and do not map to the facets listed above. The UMS was developed by writing items to represent these facets; however, after a psychometric evaluation of this scale, researchers provided numerous suggestions for improving the scale and the measurement of university mattering. Those suggestions were employed and the Revised University Mattering Scale (RUMS) was developed for use in the current study.
The purpose of this dissertation was twofold. First, the model-data fit of the RUMS was evaluated using confirmatory factor analysis (CFA). Five a priori models were tested using two independent samples: (a) a one-factor model, (b) a four-factor model, (c) a higher-order model, (d) a bifactor model, and (e) an incomplete bifactor model. In Sample 1, the incomplete model had the best overall fit. In Sample 2, the bifactor model had the best overall fit, which was surprising given that an admissible solution could not be found for this model in the first sample. However, across both samples, there were areas of localized misfit (i.e., large standardized covariance residuals). Furthermore, in using the incomplete bifactor and bifactor models to evaluate the items, numerous items were factorially complex (i.e., items cross-loaded to both the general mattering factor and their corresponding specific factor. Thus, ten items were removed. A modified model with 24-items was then fit to the data. Although fit improved, there were still areas of concern. Specifically, three items with large standardized residuals and six items that cross-loaded were deleted. The resulting 15-item measure fit a one-factor structure well and was named the Unified Measure of University Mattering-15 (UMUM-15).
The second purpose of this study was to assess the measurement invariance of the UMUM-15, the measure championed from Study 1. Of particular interest to this study was the comparison of transfer student scores on university mattering to scores of native students (i.e., students who began at the institution as first-years with no transfer credit). Transfer students constitute a large subgroup on many campuses. Transfer students often express struggling with their academic and social integration at their new campus. Therefore, it is possible that transfer students have a lower sense of mattering than native students. Qualitative research indicates that transfer students frequently report feeling a lack of mattering after relocating to their new college campuses. Furthermore, by definition, transfer students may lack feelings of mattering because they are in transition. Schlossberg (1989) theorized, “…people in transition often feel marginal and that they do not matter” (p. 6). For this study, the evaluation of measurement invariance was conducted using a step-by-step procedure introducing more equality constraints on the model across groups at each step (Meredith, 1993; Steenkamp & Baumgartner, 1998). Tests of measurement invariance began with testing configural invariance, followed by metric invariance, and finally, scalar invariance. With the establishment of scalar invariance, latent mean differences between transfer and native students were interpreted. As expected, transfer students had lower latent means than native students on university mattering. Not only did this study provide strong construct validity evidence for the UMUM-15, but this study also made several notable contributions to the current research on university mattering.
by Keston H. Fulcher, Ph.D. (2004)
Advisor: Dr. T. Dary Erwin
Construct ambiguity and methodological shortcomings of instrument development have obscured the meaning of curiosity research. Nonetheless, it is an important construct, especially since it has been linked recently to lifelong learning. The purpose of these studies is to collect validity evidence for a new self-report questionnaire, The Curiosity Index (CI), which is based on Ainley's (1987) parsimonious breadth and depth conceptualization of curiosity. Proctors administered the CI to 1042 college freshmen, 854 college sophomore/juniors, and 74 members of a lifelong learning institute. In Study 1, freshmen CI data were analyzed using confirmatory factor analysis in an exploratory manner to identify items best representing the two-factor model. After selective item removal, all indices except for the RMSEA suggested good fit. In addition to the CI, college freshmen took several other instruments. In Study 2, scores derived from these instruments were correlated to the total, breadth, and depth scores. As predicted, the total CI, breadth, and depth scores correlated moderately to highly with trait curiosity and intrinsic motivation, lowly to confidence, not at all to intelligence or extrinsic motivation, and negatively to work-avoidance. In addition, mastery-approach correlated higher to depth than to breadth as predicted. In Study 3, average total, breadth, and depth scores of freshmen, sophomore, and lifelong learners were compared via ANOVAs. It was predicted that lifelong learners would have the highest scores on all categories, then sophomores, then freshmen. Lifelong Learning Institute members and sophomores did score significantly higher on total and depth curiosity than freshmen; however, no other differences were found. In Study 4, item response theory was used to investigate the amount of information obtained by the CI along the continuum of curiosity, from the least curious to the most curious students. Generally, information was high; however, students scoring 1.5 SD s above the mean or higher were measured less reliably. Overall, the results support the use of the Curiosity Index for measuring breadth and depth curiosity. Future directions of validation include additional correlational studies with other curiosity measures, reversing the response scale, and creating more difficult breadth items.
Examining the Psychometric Properties of a Multimedia Innovative Item Format: Comparison of Innovative and Non-Innovative Versions of a Situational Judgment Test
by Sara Lambert Gutierrez, Ph.D. (2009)
Advisor: Dr. J. Christine Harmes
In the measurement field, innovative item formats have shown promise for increasing the capability to assess constructs not easily measured with traditional item formats. These items are often assumed to also provide opportunities for better measurement. However, little empirical research exists to support these assumptions. The purpose of this study was to explore the psychometric properties of a multimedia innovative item type and then compare the results to the properties of a non-innovative item format. Participants were administered one of two tests of identical content: one consisting of an innovative item format and the other consisting of a non-innovative item format. Exploratory factor analyses were conducted to evaluate the dimensionality of the two tests. The graded-response model was fit to both tests to produce item and test level characteristic curves, allowing for the examination of the reliability, or information, produced by each test and the individual items. Measurement efficiency, a ratio of the average amount of information provided relative to the average amount of time taken, was also reviewed. Face validity was examined by analyzing participant ratings on an eight-item post-test survey. Finally, criterion-related validity was investigated for the innovative item format by examining the relationship between test scores and supervisors’ ratings of employee performance. Findings from this research suggest that the use of innovative items may alter the underlying construct of an assessment, and could potentially provide more measurement information about examinees with low prioritization skills. Also, innovative item formats do not necessarily decrease measurement efficiency, as has been previously suggested. Participants’ perceptions of the tests indicated that they felt the innovative version provided a more realistic experience and increased levels of engagement. Criterion-related validity scores on the innovative version was inconsistent across two samples. The key implication of these results applies to any practitioner employing innovative items; the addition of innovative item formats likely alters the measurement properties of a test. Further examination is needed to understand whether or not the alteration results in better measurement. As the overall psychometric functioning of both versions of the assessment was low, replication is recommended prior to generalizing these results.
Integrating and Evaluating Mathematical Models of Assessing Structural Knowledge: Comparing Associative Network Methodologies
by Emily R. Hoole, Ph.D. (2005)
Advisor: Dr. Christine DeMars
Structural knowledge assessment is a promising area of study for curriculum design and teaching, training, and assessment, but many issues in the field remain unresolved. This study integrates an associative network method, the Power Algorithm from the field of text comprehension into the realm to structural knowledge assessment by comparing it to an already established associative network method, Pathfinder Analysis. Faculty members selected the fifteen most important concepts in Classical Test Theory. Students and faculty then completed similarity ratings for each concept pair using an online survey program, SurveyMonkey. A variety of similarity measures for the Power Algorithm networks and Pathfinder networks were used to predict course performance in a graduate level measurement class. For the Power Algorithm networks, the correlation between the student and expert links between the concepts in the associative network were computed, along with the congruence coefficient between the associative network links. Finally, a measure of network coherence, harmony, was calculated for each Power Algorithm network. For the Pathfinder networks, the NETSIM measure of similarity between the student and expert networks was computed. An unusual finding for the Pathfinder measure of similarity, NETSIM, was uncovered, in which NETSIM values negatively predicted course performance. Results indicate that the Power Algorithm similarity measures did not uncover a latent structure in the data, but that network harmony might possibly serve as an indicator of quality for knowledge structures. Further investigation of the use of harmony in structural knowledge assessment is recommended.
by S. Jeanne Horst, Ph.D. (2010)
Advisor: Dr. Sara J. Finney
Despite high-stakes applications of assessment findings, assessment data are frequently collected in situations that are of low-stakes to examinees. Because low-stakes tests are of little consequence to the examinees, test-taking motivation and thus the validity of inferences drawn from unmotivated examinees’ scores are of concern. The current study explored examinee self-reported effort in a several-hours long low-stakes testing context via both structural equation mixture modeling and latent growth modeling approaches. An indirect approach to the structural equation mixture modeling results provided a heuristic for understanding examinee motivation in the low-stakes context. External criteria related to effort, such as goal orientations, self-efficacy for mathematics, and personality variables contributed to explanations for three classes of examinees: higher-effort, mid-effort, and lower-effort. Expectancy-value theory, personality traits and fatigue explanations of examinee motivation in a low-stakes context are considered.
Using Verbal Reports to Explore Rater Perceptual Processes in scoring: An Application to Oral Communication Assessment
by Jilliam N. Joe, Ph.D. (2008)
Advisor: Dr. J. Christine Harmes
Performance assessment has shown increasing promise for meeting educators' needs for "authenticity" in assessment that many argue is missing from standardized multiple choice testing. However, for all of its merits, performance assessment continues to present a formidable challenge to measurement theory and practice when human raters are a component of scoring. There is little known about the cognitive processes raters employ in scoring, and in particular, scoring for oral communication assessments. The purpose of this study was to explore feature attention within an oral communication assessment scoring context, and how feature attention influenced decisions. An additional purpose was to investigate the utility of verbal reports as a method for collecting perceptual data within an aurally and visually intensive context. The present study employed a concurrent complementarity mixed methods design (Greene, Carcelli, & Graham, 1989), in which concurrent and retrospective verbal report methods were used to gather cognitive data from experienced and inexperienced raters. Specifically, verbal report data were examined to discover meaningful patterns in feature attention, as well as alignment between raters' internal frameworks and the test developer's scoring framework. Generalizability Theory was used to answer questions related to verbal report impact on scoring. Self-report data on perceived difficulty of the scoring task were also collected within each condition of verbal reporting. The findings from this research suggest that raters' internal frameworks as applied in the service of scoring did not align with the test developer's framework. Raters did not consistently attend to the features found in the scoring rubric, nor did they adhere to the scoring system (analytic). Raters demonstrated complex integrative processes that often violated assumptions held about the rating process. Experienced raters, in particular, engaged in feature attention and subsequent decision-making that often "borrowed" information from other traits to better inform judgments, particularly when the rater endeavored to establish causal relationships for failures in trait mastery. These findings have several implications for rater selection and training procedures, as well as test development in oral communication.
Using the Right Tool for the Job: An Analysis of Item Selection Statistics for Criterion-Referenced Tests
by Andrew T. Jones, Ph.D. (2009)
Advisor: Dr. Christine DeMars
In test development, researchers often depend upon item analysis in order to select items to retain or add to an exam form. The conventional item analysis statistic is the point-biserial correlation. This statistic was developed to select items that would maximize the reliability indices of norm-referenced tests. When the focus of the exam is norm-referenced scores, then the point-biserial correlation works well as an item selection tool. However, the point-biserial correlation is also used in testing contexts where it may be less useful, specifically on criterion-referenced tests. Criterion-referenced tests have different reliability indices than norm-referenced tests, known as decision consistency indices. As such, using the point-biserial correlation to select items to maximize decision consistency may not have as much utility as other options. Researchers have developed several criterion-referenced item analysis statistics that have yet to be fully evaluated for their utility in selecting items for criterion-referenced tests. The purpose of this research was to evaluate each of the respective criterion-referenced item selection tools as well as the point-biserial correlation to determine which one optimized decision consistency.
by Cassandra R. Jones, Ph.D. (2009)
Advisor: Dr. Donna Sundre
Recently more universities have started administering course evaluations online. With the process no longer in the classroom, some students decide not to complete their course evaluations during their own time, resulting in concerns about online course evaluation results being biased because of lack of response. This study examined course evaluation results at a small diverse Mid-Atlantic Catholic university. A cross-classified random effects model was used to capture student responses across all of their courses. Nonresponse bias was examined by determining predictors of participation and predictors of online course evaluation ratings. Variables predicting both participation and ratings were considered to be a potential source of nonresponse bias. It was found that gender, ethnicity, and final course grade predicted online course evaluation ratings. Only final course grade predicted online course evaluation ratings.
An Empirical Demonstration of Direct and Indirect Mixture Modeling When Studying Personality Traits: A Methodological-Substantive Synergy
by Pamela K. Kaliski, Ph.D. (2009)
Advisor: Dr. Sara Finney
Many personality psychology researchers have employed the person-centered approach of cluster analysis to determine how many categorical Big Five personality types exist. The majority of these researchers have suggested that three Big Five personality types exist; however, results from two recent studies suggested that five types exist. In the first part of the current study, direct mixture modeling (an alternative person-centered approach to cluster analysis), was conducted on Big Five personality variables to explore the number of Big Five personality types that exist in college students, and two methodological approaches for gathering validity evidence for the personality types were demonstrated. Although more validity evidence must be gathered, results of the direct MM suggested that three personality types may exist in college students; however, the types differed in form from the three types that are commonly reported. In the second part of the current study, the same results were used to demonstrate an application of indirect mixture modeling. As opposed to interpreting the classes as substantively meaningful discrete subgroups, they were interpreted as common configurations that best represent the aggregate dataset. Additionally, the variable-centered approach of multiple regression was conducted. A comparison of the multiple regression results and the indirect mixture modeling results reveal the similarities and differences.
by James R. Koepfler (2012)
Over the past decade, educational policy trends have shifted to a focus on examining students’ growth from kindergarten through twelfth grade (K-12). One way States can track students’ growth is through the use of a vertical scale. Presently, every State that uses a vertical scale bases the scale on a unidimensional IRT model. These models make a strong but implausible assumption that a single construct is measured, in the same way, across grades. Additionally, research has found that variations of psychometric methods within the same model can result in different vertical scales. The purpose of this study was to examine the impact of three IRT models (unidimensional model, U3PL; bifactor model with grade specific subfactors, BG-M3PL; and a bifactor model with content specific factors, BC-M3PL); three calibration methods (separate, hybrid, and concurrent), and two scoring methods (EAP pattern and EAP summed scoring; EAPSS) on the resulting vertical scales. Empirical data based on a States’ assessment program were used to create vertical scales for Mathematics and Reading from Grades 3-8. Several important results were found. First, the U3PL model always resulted in the worst modeldata fit. The BC-M3PL fit the data best in Mathematics and the BG-M3PL fit the data best in Reading. Second, calibration methods led to minor differences in the resulting vertical scale. Third, examinee proficiency estimates based on the primary factor for each model were generally highly correlated (.97+) across all conditions. Fourth, meaningful classification differences were observed across models, calibration methods, and scoring methods. Overall, I concluded that none of the models were viable for developing operational vertical scales. Multidimensional models are promising for addressing the current limitations of unidimensional models for vertical scaling but more research is needed to identify the correct model specification within and across grades. Implications for these results are discussed within the context of research, operational practice, and educational policy.
Using Response Time and the Effort-Moderated Model to Investigate the Effects of Rapid Guessing on Estimation of Item and Person Parameters
by Xiaojing Kong, Ph.D. (2007)
Advisor: Dr. Steven Wise
Rapid-guessing behavior, an aberrant examinee behavior observed frequently in testing, creates a possible source of systematic measurement error undermining psychometric quality of items and tests, and the validity of test scores. The purposes of this dissertation were to examine how and to what extent rapid guessing can impact item parameter and proficiency estimates, and to explore and evaluate the effectiveness of specific psychometric models controlling for rapid guesses. Five interrelated studies were conducted, involving the use of item response times for detecting rapid-guessing behavior in the empirical study, and the employment of the observed distribution of response time effort in the simulation studies. The primary investigation involved comparing the performance of the standard IRT models (i.e., 3PL, 2PL, and 1PL) with that of the effort-moderated item response model (Wise & DeMars, 2006) and its variations (i.e., EM-3PL, EM-2PL, and EM-1PL), with respect to model fit, item parameter estimates, proficiency estimates, and test information and reliability. The performance discrepancies were first studied using data from a computer-based, low-stakes achievement test. The direction and magnitude of estimation bias under each model were further examined in such simulated conditions that the proportions of rapid guesses presented in the data varied. Moreover, comparisons between the standard and EM models were conducted for conditions in which the probability of guessing an item right was correlated with examinees' level of proficiency. Additionally, the influence of rapid guessing on item parameter estimates was examined in the framework of classical test theory.</span></p> <p style='line-height:normal'><span lang=EN style='font-size:9.0pt;font-family:Verdana;"Times New Roman";'>Results indicate that a small proportion of rapid guesses can bias item indices and examinee proficiency estimates to a notable extent, and that the undesirable influence can be augmented by increased proportions of rapid guesses. The EM models produced more accurate estimates of item parameters, examinee proficiency, and test information than their counterpart IRT models in most simulated conditions. However, exceptions were observed with the two- and one-parameter models. Also, different patterns were found for conditions in which some level of cognitive process was assumed to be involved during a rapid guess.
by Abigail R. Lau, Ph.D. (2009)
Advisor: Dr. Dena Pastor
Test-takers can be required to complete a test form, but cannot be forced to demonstrate their knowledge. Even if an authority mandates completion of a test, examinees can still opt to enter responses randomly. When a test has important consequences for individuals, examinees are unlikely to behave this way. However, random responding becomes more likely when the consequences associated with a test are less significant to the examinees. To thwart random responding, test administrators have explored methods to motivate examinees to respond attentively. Ultimately, differences in how examinees approach low-stakes tests are inevitable, and measurement models that account for this difference are needed. This dissertation provides an overview of the approaches that have been proposed for modeling low-stakes test data. Further, it specifically investigates the performance and utility of the mixed-strategies item response model (Mislevy & Verhelst, 1990) as one method of capturing amotivated examinees. Amotivated examinees are defined here as examinees who do not provide meaningful responses to any test items. A simulation study shows that if a normal item response model is used, parameter recovery rates are unacceptable when 9% or more of the examinees were amotivated. However, normal item response models may still be useful if less than 1% of examinees were amotivated. Use of the mixed-strategies item response model led to better parameter estimation than the normal item response model regardless of the proportion of amotivated examinees in the dataset. Additional research is needed to determine if using the mixed-strategies model results in satisfactory parameter recovery when greater than 20% of examinees were amotivated. A second study shows that when the mixed-strategies model was used on real low-stakes test data, the examinees classified as amotivated reported much lower test-taking effort than other examinees. However, examinees classified as amotivated were not very different than other examinees in terms of academic ability. This finding supports the notion that the second class in the mixed-strategies model is capturing amotivated examinees rather than low-ability examinees. Limitations of the mixed strategies modeling technique are discussed, as is the appropriateness of applying this technique in various testing contexts.
Comparing the Relative Measurement Efficiency of Dichotomous and Polytomous Models in Linear and Adaptive Testing Conditions
by Susan Daffinrud Lottridge, Ph.D. (2006)
Advisor: Dr. Christine DeMars
The purpose of this study was to examine the relative performance of the dichotomous and nominal item response theory models in a linear testing and adaptive testing environment. A simulation study was conducted to investigate the relative measurement efficiency when moving from a dichotomous linear test to a dichotomous adaptive test, nominal linear test, and nominal adaptive test. Item exposure was also considered. Two dichotomous models (2PL, 3PL) and two nominal models (Bock's Nominal Model, Thissen's Nominal Model) were used. The simulated data were based upon responses to a 58-item mathematics test by 6711 students, and Ramsay's nonparametric item response theory method was used to generate option characteristic curves. These curves were then used to generate simulation data. MULTILOG was used to estimate item parameters. An item pool of 522 items was generated from the 58 items, with items being shifted left or right by increments of .05 to create new items. A 30-item fixed-length test was used, as was a 30-item adaptive test. 100 simulees were generated at each of 47 [straight theta] points on [-2.3, +2.3]. Using empirically derived standard errors, results indicated that the adaptive test and polytomous linear test outperformed the dichotomous linear test. The Thissen Nominal Model linear test performed similarly to the 3PL adaptive test, suggesting its potential use in place of the more expensive adaptive test. The Bock Nominal Model linear test also performed better than the 2PL linear test, but not as well as either of the adaptive tests. Future studies are suggested for better understanding the Thissen Nominal Model in light of its performance relative to the 3PL adaptive test.
Examining the Bricks and Mortar of Socioeconomic Status: An Empirical Comparison of Measurement Methods
by Ross E. Markle, Ph.D. (2010)
Advisor: Dr. Dena A. Pastor
The impact of socioeconomic status (SES) on educational outcomes has been widely demonstrated in the fields of sociology, psychology, and educational research. Across these fields however, measurement models of SES vary, including single indicators (parental income, education, and occupation), multiple indicators, hierarchical models, and most often, an SES composite provided by the National Center for Educational Statistics. This study first reviewed the impact of SES on outcomes in higher education, followed by the various ways in which SES has been operationalized. In addition, research highlighting measurement issues in SES research was discussed. Next, several methods of measuring SES were used to predict first-year GPA at an institution of higher education. Findings and implications were reviewed with the hope of promoting more careful considerations of SES measurement.
Unfolding Analyses of the Academic Motivation Scale: A Different Approach to Evaluating Scale Validity and Self-Determination Theory
by Betty Jo Miller, Ph.D. (2007)
Advisors: Dr. Donna Sundre and Dr. Christine DeMars
Using the framework of a strong program of construct validation (Benson, 1998), the current study investigated Self-Determination Theory (SDT; Deci & Ryan, 1985), the construct of academic motivation, and the Academic Motivation Scale (AMS; Vallerand et al., 1992). Building upon a body of prior research that provided only limited support for the theory and the seven-factor structure of the scale, a technique other than factor analysis was used to analyze responses to the AMS. Specifically, the utility of a unidimensional unfolding model in analyzing such responses was explored. In addition, scale development efforts were pursued, and multiple measures of academic motivation within a single sample of students were compared. Data were collected from three large samples of university students over the period of one year. The AMS and other instruments were self-report measures administered on computer and by paper-and-pencil. Qualitative data were collected from the second sample for the purposes of exploring new content for pilot items and for explaining certain results. Results have important implications for both SDT and the measurement of academic motivation using the AMS. A unidimensional unfolding model was shown to provide adequate fit to the data, supporting the argument that academic motivation is a single construct ordered along a continuum according to increasingly internal degrees of self-regulation. Using the estimated item locations, a shortened version of the AMS was proposed that was highly reliable and consistent with SDT. Finally, a comparison of unfolded motivation scores with summated AMS subscale scores revealed the folding of the response process.
by Christopher Orem (2012)
Meta-assessment, or the assessment of assessment, can provide meaningful information about the trustworthiness of an academic program's assessment results (Bresciani, Gardner, & Hickmott, 2009; Palomba & Banta, 1999; Suskie, 2009). Many institutions conduct meta-assessments for their academic programs (Fulcher, Swain, & Orem, 2012), but no research exists to validate the uses of these processes' results. This study developed the validity argument for the uses of a meta-assessment instrument at one mid-sized university in the mid-Atlantic. The meta-assessment instrument is a fourteen-element rubric that aligns with a general outcomes assessment model. Trained raters apply the rubric to annual assessment reports that are submitted by all academic programs at the institution. Based on these ratings, feedback is provided to programs about the effectiveness of their assessment processes. Prior research had used Generalizability theory to derive the dependability of the ratings provided by graduate students with advanced training in assessment and measurement techniques. This research focused on the dependability of the ratings provided to programs by faculty raters. In order to extend the generalizability of the meta-assessment ratings, a new fully-crossed G-study was conducted with eight faculty raters to compare the dependability of their ratings to those of the previous graduate student study. Results showed that the relative and absolute dependability of two-rater teams of faculty ([Rho]2 = .90, [Phi] = .88) were comparable to the dependability estimates of two-rater teams of graduate students. Faculty raters were more imprecise than graduate students in their ratings of individual elements, but not substantially. Based on the results, the generalizability of the meta-assessment ratings was expanded to a larger universe of raters. Rater inconsistencies for elements highlighted potential weaknesses in rater trainings. Additional evidence should be gathered to support several assumptions of the validity argument. The current research provides a roadmap for stakeholders to conduct meta-assessments and outlines the importance of validating meta-assessment uses at the program, institutional, and national levels.
by Suzanne L. Pieper, Psy.D. (2003)
Advisors: Dr. Donna L. Sundre and Dr. Sara J. Finney
This study refining and extending the 2 x 2 achievement goal framework of mastery-approach, mastery-avoidance, performance-approach, and performance-avoidance goals had three purposes: (1)to investigate the possibility of a fifth goal orientation: work avoidance, (2)to examine the functioning of new items written to better measure the four goal orientations, and (3) to gather validity evidence for the four goal orientations and possibly a fifth goal orientation by examining the association between the variables need for achievement and fear of failure and the goal orientations. The results of this study provided support for the four-factor model of achievement goal orientation using the 12-item Achievement Goal Questionnaire (AGQ) (Elliot & McGregor, 2001) modified for a general academic domain. The four-factor model provided a good fit to the data and a better fit than competing models. Second, the results of this study provided support for the improved reliability and validity of the 16-item AGQ with one item added to each goal orientation subscale to improve measurement. Third, the results of this study provided strong evidence for the existence of a fifth goal orientation: work-avoidance. The five-factor model of goal orientation--mastery-approach, mastery-avoidance, performance-approach, performance-avoidance, and work-avoidance--as measured by the 20-item AGQ provided a good fit to the data. Furthermore, the work-avoidance orientation demonstrated relationships with the criterion variables workmastery, competitiveness, and fear of failure that were expected based on previous theory and research. While this study answers the call of Maehr (2001) to reinvigorate goal theory by considering many possible ways students engage in learning, much still needs to be done in terms of defining and assessing the work-avoidance goal orientation. Additionally, the limitations of this study need to be addressed. The results of this study need to be validated with other student populations and in a variety of educational contexts. Finally, because the same sample of college students was used for all three analytical stages of this study, thereby increasing the possibility for Type 1 error, future studies need to validate these results with fresh samples.
by Shelley Ragland, Ph.D. (2010)
Advisor: Dr. Christine E. DeMars
In order to be able to fairly compare scores derived from different forms of the same test within the Item Response Theory framework, all individual item parameters must be on the same scale. A new approach, the RPA method, which is based on transformations of predicted score distributions was evaluated here and was shown to produce results comparable to the widely used Stocking-Lord (SL) method under varying conditions of test length, number of common items, and differing ability distributions in a simulation study. The new method was also examined using actual student data and a resampling analysis. Both the simulation study and actual student data study resulted in very similar transformation constants for the RPA and SL methods when 15 or 10 common items were used. However, the RPA method produced greater variance, especially when only 5 common items were used in the actual student data analysis compared to the SL method. The simulated and actual data research findings demonstrate that the RPA method is a viable option for producing the transformation constants necessary for transforming separately calibrated item parameter estimates prior to equating.
Comparability of Paper-and-Pencil and Computer-Based Cognitive and Non-Cognitive Measures in a Low-Stakes Testing Environment
by Barbara E. Rowan, Ph.D. (2010)
Advisor: Dr. Joshua T. Goodman and Dr. J. Christine Harmes
Computerized versions of paper-and-pencil tests (PPT) have emerged over the past few decades, and some practitioners are using both formats concurrently. But computerizing a PPT may not yield equivalent scores across the two administration modes. Comparability studies are required to determine if the scores are equivalent before treating them as such. These studies ensure fairer testing and more valid interpretations, regardless of the administration mode used. The purpose of this study was to examine whether scores from paper-based and computer-based versions of a cognitive and a non-cognitive measure were equivalent and could be used interchangeably. Previous research on test score comparability used simple methodology that provided insufficient evidence for the score equivalence. This study, however, demonstrated a set of methodological best practices, providing a more complex and accurate analysis of the degree of measurement invariance that exists across groups. The computer-based test (CBT) and PPT contained identical content and varied only in administration mode. Participants took the tests in only one format, and the administration was under low-stakes conditions. Confirmatory factor analyses were conducted to confirm the established factor structure for both the cognitive and the non-cognitive measures, and reliability and mean differences were checked for each subscale. The scalar, metric, and configural invariance were tested across groups for both measures. Because of the potential impact on measurement invariance, differential item functioning (DIF) was tested and those items were removed from the data set; measurement invariance across test modes was again evaluated.
Results indicate that both the cognitive and the non-cognitive measures were metric invariant (essentially tau-equivalent) across groups, and the DIF items did not impact the degree of measurement invariance found for the cognitive measure. Therefore, the same construct was measured to the same degree, but scores are not equivalent without rescaling. Measurement invariance is a localized issue, thus, comparability must be for each instrument. Practitioners cannot assume that the scores obtained from the PPT and CBT will be equivalent. How these test scores are used will determine what changes must be made with tests that have less than strict measurement invariance.
Development and Validation of the Preservice Mathematical Knowledge for Teaching Items (PMKT): A Mixed-Methods Approach
by Javarro Russell (2011)
Advisor: Dr. Robin D, Anderson
Mathematical knowledge for teaching (MKT) is the knowledge required for teaching mathematics to learners. Researchers suggest that this construct consists of multiple knowledge domains. Those domains include teachers’ knowledge of mathematical content and knowledge about teaching mathematics. These domains of MKT have been theoretically and empirically examined to determine their effects on K-12 student achievement. However, empirical evidence of this relationship is limited due to a lack of measures to assess MKT.
Recently, researchers have constructed measures of MKT to evaluate the effectiveness of professional development activities with in-service teachers. These measures, however, lack validity evidence for use in teacher education program assessment. Program assessment allows programs to determine the effectiveness of their curriculum on assisting preservice teachers in meeting learning outcomes. This process requires adequate tools for assessing the extent to which students meet the learning outcomes. In a teacher education program, some of those learning outcomes are related to MKT. To assess these outcomes, teacher educators need measures of MKT that relate to their learning objectives. Previous research has not supported the use of any current measure for assessing MTK in a teacher education program.
To address this gap in the literature, a process of construct validation was conducted for items developed to assess MKT at the program level of a teacher education program. Validation evidence for the items was obtained by using Benson’s framework of a strong program of construct validation. The factor structure of the items was analyzed and expected group differences were assessed. Qualitative data from cognitive interviews were then used to provide convergent evidence in regards to the construct validity of the items. The overall purpose of these methods of inquiry was to develop items that would reflect the MKT that resulted from a teacher education mathematics curriculum.
Results from factor analyses indicated that the 23 PMKT items could plausibly be composed into an 11- item essentially unidimensional scale of specialized content knowledge. The factor underlying responses to the 11-item scale appeared to be related to a specified learning objective of the program. This learning objective suggests that graduating preservice teachers should be able to evaluate a K-8 student’s mathematical work or arguments to determine if the ideas presented are valid. Interviews with participants revealed themes indicating that the items were measuring this aspect of specialized content knowledge. Comparisons among students at differing levels of the mathematics education curriculum revealed significant, but small differences between upper level preservice teachers and preservice teachers whom received no instruction. Further analysis of these items indicated that they could be improved by focusing future item development on examining misconceptions in evaluating mathematical arguments. These findings have several implications for teacher education program assessment, as well as item development for measuring MKT.
by Kelly A. Williams Scocos, Psy.D. (2002)
Advisor: Dr. Steven L. Wise
The goal of this dissertation was to investigate the viability of the most broadly accepted definition of critical thinking. This definition is the Delphi model (Facione, 1990) and it receives support from professionals both in education and in business. A single, multipart instrument, the Williams Critical Thinking Assessment, was developed to measure the individual facets of critical thinking delineated by the Delphi conceptualization. Results indicated that the Delphi model constituted a workable critical thinking definition. Furthermore, critical thinking defined in a manner consistent with the Delphi model was demonstrated to be distinct from scholastic achievement. Educationally, these discoveries have implications for both critical thinking instruction and learning in a collegiate environment.
by J. Carl Setzer, Ph.D. (2008)
Advisor: Dr. Dena Pastor
Recently, there have been two types of model formulations used to demonstrate the utility of explanatory item response models. Specifically, the generalized linear mixed model (GLMM) and hierarchical generalized linear model (HGLM) have expanded item response models to include covariates for item effects, person effects, or both simultaneously. Both frameworks have recently been garnering greater attention in the educational measurement field. Despite these two frameworks being conceptually equivalent, much of the related literature has emphasized one or the other. However, to date, there has been little attempt to associate the frameworks together. In addition, item response models that have been described within the GLMM and HGLM frameworks have mostly been of the unidimensional type. Very little has been done to demonstrate the utility of an explanatory multidimensional item response model. As explanatory models become more prevalent in research and practice, it is important to maintain software that can estimate them. SAS is an all-purpose and widely-used program that can estimate explanatory item response models. However, no previous research has examined how well SAS can recover the parameters of an explanatory multidimensional Rasch model (EMRM). There were three main goals of this study. First, several types of Rasch models, including both non-explanatory and explanatory models, were summarized within the GLMM and HGLM frameworks. The equivalence of these two frameworks was demonstrated for each model. Second, a parameter recovery study was performed to determine how well SAS PROC NLMIXED can recover the parameters of an EMRM. The effect of sample size and test length on parameter recovery was assessed. The results of the simulation study indicate that very little bias occurs, even with small sample sizes and short test lengths. The final goal was to demonstrate the utility of an EMRM model using empirical data. Using data collected from the Marlowe-Crowne Social Desirability Scale (MCSDS), an EMRM was fit to the data while using gender as a covariate. Interpretations of the model parameter estimates were given and it was concluded that gender did not explain a significant amount of variation in either of the MCSDS subscales.
Cyberspace Versus Face-to-Face: The Influence of Learning Strategies, Self-Regulation, and Achievement Goal Orientation
by Kara Owens Siegert, Ph.D. (2005)
Advisor: Dr. Christine DeMars
Web-based education (WBE) is a popular educational format that allows certain learning and teaching advantages. However, some students may not learn or perform as well in this environment as compared to traditional face-to-face education (F2FE) settings. Little research has examined the differential impact of learner characteristics on performance in these two environments. This study explored differences in learning strategies, self-regulation skills, and achievement goal orientation, in WBE and F2FE college classrooms and found that students in the two environments could be differentiated based on the composite of learner characteristics. Specifically, WBE and F2FE students differed in terms of self-regulation, elaboration, and mastery-avoidance goals. Learner characteristics, however, did not have a differential influence on college student performance in the two environments.
Should We Worry About the Way We Measure Worry Over Time? A Longitudinal Analysis of Student Worry During the First Two Years of College
by Peter J. Swerdzewski, Ph.D. (2008)
Advisor: Dr. Sara Finney
This study evaluated longitudinal change in student worry using the Student Worry Questionnaire-30 (SWQ-30), an instrument that represents worry as six separate factors: (1) Worrisome Thinking, (2) Financial-Related Concerns, (3) Significant Others' Well-Being, (4) Academic Concerns, (5) Social Adequacy Concerns, and (6) Generalized Anxiety Symptoms. Prior to evaluating longitudinal change, the factor structure of the SWQ-30 was examined using four cross-sectional independent samples. A best-fitting six-factor model was found that removed four redundant items from the original 30-item instrument. This six-factor 26-item model was then fit to data from a longitudinal sample of students who completed the measure as entering freshmen and second-semester sophomores. Evidence for full configural and metric invariance was found. When the data were tested for scalar invariance, one item from each of the following subscales was found to be scalar non-invariant: Worrisome Thinking, Social Adequacy Concern, and Financial-Related Concern. Additionally, most of the items from the Generalized Anxiety Symptoms factor were found to be scalar non-invariant, thus making the latent mean difference for the factor uninterpretable. Overall, interpretable latent mean differences and stability estimates provided evidence that student worry was stable over time, although students appeared to decrease in the degree to which they worried about social adequacy. These findings suggest that some aspects of worry and the infamous sophomore slump may be unrelated phenomena. In sum, the SWQ-30 is a promising measure of multidimensional student worry; however, it has not received adequate empirical study. Furthermore, given the dearth of empirical research examining the stability of student worry over time and the unique characteristics of the samples under study, future research must be conducted to better uncover the link between worry and sophomore slump.
An Application of Generalizability Theory to Evaluate the Technical Quality of An Alternate Assessment
by Melinda A. Taylor, Ph.D. (2009)
Advisor : Dr. Dena Pastor
Federal regulations require testing of students with the most severe cognitive disabilities; although, little guidance has been given regarding the format of such assessments or how technical quality should be documented. It is well documented that specific challenges exist with the documentation of technical quality for alternate assessments that are often less standardized than their general assessment complements. One of the first steps in documenting technical quality is to determine the reliability of scores resulting from an assessment. Typical measures of reliability under a classical test theory framework, such as coefficient alpha, do little in modeling the multiple sources of error that are characteristic of alternate assessments. Instead, Generalizability theory (G-theory) allows rese! ! archers to identify potential sources of variability in scores and to analyze the relative contribution of each of those modeled sources. The purpose of this study was to demonstrate an application of G-theory to examining the technical quality of scores from an alternate assessment. A G-study where rater type, assessment attempts, and tasks were identified as facets was examined to determine the relative contribution of each facet to observed score variance. Data resulting from the G-study were used to examine the reliability of scores using a criterion-referenced interpretation of error variance associated with scores. The current assessment design was then modified to examine how changes in the design might impact the reliability of scores. Based on established criteria, the proposed designs were evaluated in terms of their ability to yield acceptable reliability coefficients. As a final step in the analysis, designs that were deemed satisfactory were evaluated from a pract! ! ical standpoint with respect to the feasibility of adapting them into a statewide standardized assessment program used for student and school accountability purposes.
by Amy DiMarco Thelk, Ph.D. (2006)
Advisor: Dr. Donna L. Sundre
Published literature reveals little information about whether examinees should be told of established performance expectations prior to test taking. This study investigated whether students who are told of a test's cut scores, information about student performance from previous test administrations, or both types of information have significantly different test performance or motivation scores than those receiving only the standardized instructions. This research was conducted at a community college during regular assessment testing. Students taking a quantitative and scientific reasoning exam (QRSR) were assigned to one of four testing conditions. Motivation information was collected via two measures: Response Time Effort (RTE; Wise & Kong, 2005) and the Student Opinion Scale (SOS; Sundre, 1999). A confirmatory factor analysis was conducted to determine whether the two-factor structure of the SOS held up when administered to a community-college sample. The results support the established structure when administered in this setting. The second phase of analysis involved testing three path models to assess the impact of (a) SOS; (b) RTE; and (c) SOS and RTE on test scores. While the treatments had only small, and contradictory, effects on SOS and RTE, all three models were significant. SOS accounted for 9% of test score variance, RTE alone accounted for 16% of the variance in test scores, and the combination of RTE and SOS accounted for 19% of the variance in test scores. The final phase of the project involved interviewing a sample of students (n=8) following testing. Interviewees were asked about treatment recognition, effort, and ideas about motivating students in testing situations. While students were able to recognize the written information they had seen prior to testing, only one freely recalled the seeing additional data prior to testing. These findings call the potency of the manipulations into question. Also, while students verbally reported variations in how hard they tried, scores on the Effort subscale were not significantly different. The results of this study do not offer strong guidance on whether to tell students about cut scores prior to testing. Limitations of the research and suggestions for future research are offered.
by John Taylor Willse, Psy.D. (2002)
Advisor: Christine DeMars
Computer adaptive tests (CAT) have a tendency to capitalize on chance errors in a-parameter estimates (van der Linden and Glas, 2000). A-stratified, match difficulty, separate item-selection/item-scoring (half), and 1-pl only CATs were compared to a maximum information CAT for their ability to address the negative effects associated with controlling capitalization on chance. The CATs were evaluated in 3 simulations (i.e., using 1-, 2-, and 3-pl true item response theory models). Results were presented in terms of prevention of capitalization on chance and overall effectiveness. The phenomenon of capitalization on chance by a maximum information CAT was replicated. The astratified, match difficulty, and half CATs were successful at preventing capitalization on chance. Through consideration of overall effectiveness and ease of implementation, the match difficulty CAT was determined to be the best alternative to the maximum information CAT. The 1-pl only CAT was shown to be a poor alternative, especially in the 3-pl true item simulation.