Doctoral Dissertations from the Assessment & Measurement Program
by Robin D. Anderson, PsyD (2001)
Advisor: Dr. Donna Sundre
The purpose of this study was to examine whether one of the most common standardized testing procedures, the collection of demographic information prior to testing, facilitates performance decrements in subjects for whom a negative domain performance stereotype exists. The primary investigation involved examining whether the presence of a gender identification section on an optical readable form and the request that the gender section of the form be completed was a priming stimulus sufficient to trigger a stereotype threat effect. This study provided a real world adaptation of previous stereotype threat research. Results indicate that the inclusion of a gender identification item is not a sufficient priming stimulus to trigger stereotype threat patterns in low-stakes assessments. Results do indicate, however, that the removal of such an item may increase motivation and performance for both negatively and positively stereotyped groups.
by Susan K. Barnes, Ph.D. (2010)
Advisor: Dr. J. Christine Harmes
In this era of increased accountability in education, there is an urgent need for tools to use in assessing the abilities and instructional levels of young children. Computers have been used successfully to assess the abilities and achievements of older children and adults. However, there is a dearth of empirical research to provide evidence that computer-based testing (CBT) is appropriate for use with typically developing children under the age of six.
The purpose of this study was to explore the feasibility of using CBT with children in preschool and kindergarten. Children were administered paper-and-pencil (PPT) and CBT versions of the rhyme awareness subscale of the Phonological Awareness Literacy Screening (Preschool). After completing each assessment, each child shared individual reactions by selecting a card illustrating an emotion (e.g., joyful, happy, bored, sad, angry) and participating in a brief interview. Parents and teachers completed short questionnaires describing each child’s previous computer experience, fine motor skills, and ability to recognize and generate rhymes.
An embedded mixed methods design was used to explore (a) to what extent children could complete the CBT independently, (b) how children reacted to the tests, and (c) how the results from the CBT and the PPT compared. Interview transcripts and field notes were used to more fully explain the test results. Findings indicated that preschool and kindergarten children needed help with the CBT. Difficulties were related to using the mouse and following directions. About 12% of the kindergarteners needed adult support to finish the CBT, compared to nearly half of the preschoolers. Children of all ages reported enjoying using the computer and doing the rhyming tasks, however, many preschoolers appeared anxious to leave the testing area or tried to discuss topics unrelated to the assessment. For preschoolers, there was a test administration mode effect; the CBT was more difficult than the PPT. These results have implications for test development and use. CBTs for preschoolers must be designed to meet their physical and cognitive developmental needs. Also, preschool children need adequate practice using computer hardware and software before they can reliably demonstrate their skills and abilities through CBT.
Examining Change in Motivation Across the Course of a Low-Stakes Testing Session: An Application of Latent Growth Modeling
by Carol L. Barry, Ph.D. (2010);
Advisor: Dr. Sara J. Finney
As the emphasis on accountability in education increases, so does the prevalence of low-stakes testing. It is essential to understand test-taking motivation in low-stakes contexts, as low motivation has implications for the validity of inferences made from test scores about examinee knowledge and ability. The current study expanded upon previous work by exploring the existence of types of test-takers characterized by qualitatively different patterns of test-taking effort across the course of a three-hour low-stakes testing session. Mixture modeling results did not support the existence of types of test-takers for this sample of upperclass examinees. Latent growth modeling results indicated that change in effort across the testing session was well-represented by a piecewise growth form, wherein effort increased from the first to fourth test and then decreased from the fourth to fifth test. Further, there was significant variability in effort for each test as well as in rates of change in effort. The inclusion of external predictor variables indicated that whether an examinee attended the regular testing session versus a makeup session, mastery approach goal orientation, conscientiousness, and agreeableness partly accounted for variability in effort for the various tests, whereas only agreeableness was related to rates of change in effort. Additionally, the degree to which examinees viewed a particular test as important was weakly to moderately related to effort for a difficult, cognitive test but not for less difficult, noncognitive tests. Further, change in test-taking effort was not related to change in perceived test importance. These results have important implications both for assessment practice and the use of motivation theories to understand test-taking motivation.
by Anna Katherine Busby, Ph.D. (2005)
Advisor: Dr. Christine DeMars
This study provides validity evidence for the use of the Leadership Attitudes and Beliefs Scale III (LABS III; Wielkiewicz 2000) scores. The scale is based upon the ecology theory of leadership (Allen, Stelzner, & Wielkiewicz, 1998), and is designed to measure the attitudes and beliefs college students have toward leadership. This study was conducted with 845 college students at a large, mid-western, urban institution. The content of the LABS III items was examined to determine the relationship between the ecology theory of leadership and the scale. The items did not completely represent of the ecology theory. A confirmatory factor analysis (CFA) was conducted to test the hypothesized two-factor model, and the data did not fit the hypothesized model well. The scale was modified using theoretically-supported model modifications and additional research questions were explored. The modified LABS III scores were correlated with scores from the Miville-Guzman Universality-Diversity Scale-Short Form (Fuertes, Miville, Mohr, Sedlacek, & Gretchen, 2000). A moderate correlation was found and this result supported the hypothesis that there is a relationship between attitudes toward diversity and attitudes toward leadership. The modified LABS III scores were also correlated with the subscale scores of the Student Leadership Practices Inventory (Posner & Brodsky, 1992). Moderate correlations were found and this result supports the hypothesis that leadership attitudes are related to leadership practices. It was hypothesized that age would be strongly correlated with leadership attitudes; however, the results did not support this hypothesis. The results also supported the hypothesis that men and women differ in their attitudes toward leadership. Further examination of the ecology theory of leadership in relation to the LABS III and the LABS III factor structure is recommended. The results from this study suggest that a number of theory-based hypotheses were supported. However, continued refinement of the theory and its relationship to the scale needs to be explicated. Only through continued reflection and careful study can the nomological net of the ecology theory of leadership be developed and contribute to research in leadership.
Invariance of the Modified Achievement Goal Questionnaire Across College Students with and without Disabilities
by Hilary Lynne Campbell, Ph.D. (2007)
Advisor: Dr. Dena Pastor
As an increasing number of students with disabilities (SWDs) is taking part in postsecondary education, postsecondary institutions must meet the needs of this unique population. Because it is linked to important achievement-related outcomes, one area in which educators have historically tried to meet students' needs is achievement goal orientation (AGO). Educators must ensure that they are able to measure AGO for SWDs and to determine whether SWDs would benefit from different services or educational methods than their nondisabled peers. In the K-12 literature, studies suggest that SWDs may have different AGO profiles than their peers, but no such research has been conducted for college students.>One specific instrument designed to measure AGO, the modified Achievement Goal Questionnaire (AGQ-M; Finney, Pieper, & Barron, 2004) was administered to college students with and without disabilities. Confirmatory factor analyses were conducted with both populations to test the four-factor structure of AGO (Mastery-Approach, Mastery-Avoidance, Performance-Approach, Performance-Avoidance). Next, a series of tests were conducted to test the measurement and structural invariance of the AGQ-M across students with and without disabilities. Finally, latent means for the two samples on each dimension of AGO were compared.The four-factor model of AGO fit both samples well. Further, invariance of factor loadings (metric invariance), intercepts (scalar invariance), error variances, factor variances, and factor covariances were supported. Since the AGQ-M was found to be invariant, latent means were compared. In contrast to previous findings in the literature, results indicated no significant or practically meaningful differences between these two groups on any of the four dimensions of the AGQ-M. These results suggest that college students with and without disabilities may not have markedly different AGO profiles. Results may differ from previous findings because the sample of SWDs in this study had already completed several semesters of college at a moderately selective institution; these students likely differed in important ways from the general population of SWDs. This study lays the groundwork for a host of future studies, including replication studies, involving specific disability groups, and linking AGO profiles to external achievement-related variables for college students with and without disabilities.
Using explanatory item response models to examine the impact of linguistic features of a reading comprehension test on English language learners
by Jaime A. Cid, Ph.D. (2009)
Advisors: Dr. Dena Pastor and Dr. Joshua Goodman
The unintended consequences of high-stakes testing decisions made on scores that may vary as a function of language proficiency have been noted as a major threat to English language learners (ELLs) (Herman & Abedi, 2004; Mahoney, 2008). While several studies have focused on the effects of language proficiency in high-stakes science and math examinations, the impact of English language proficiency on reading comprehension tests has received far less attention. Furthermore, the effects that specific linguistic features of reading comprehension tasks have on ELL's test performance have been noticeably understudied. The overall aim of this study was to examine the impact of seven linguistic features (false cognates, homographs, negative wording, propositional density, surface structure, syntactic complexity, and vocabulary) of high-stakes reading comprehension test on Spanish-speaking ELLs using explanatory item response models conceptualized as Hierarchical Generalized Linear Models (HGLMs). More specifically, in a 40-item reading test explanatory item response models were used to investigate: (a) differential item functioning (DIF) for ELLs and non-ELLs in a traditional manner; (b) whether items consisting of certain linguistic features were differentially difficult; (c) the extent to which linguistic features may be differentially difficult for ELLs in comparison to non-ELLs; and (d) whether the difficulty of the items with such linguistic features varied across ELL with different years of formal exposure to Spanish as primary language of academic instruction. The results of investigating DIF in a traditional manner revealed that six items (four favoring non-ELL and two favoring ELLs) displayed DIF with group differences of at least half a logit. The estimates of the effects of the seven linguistic features were statistically significant ( p < 0.0001). However, only false cognates, negative wording, surface structure, and vocabulary increased the difficulty of an item. The differential functioning of the seven linguistic features revealed that the log-odds of getting a typical item right were 0.4867 logits lower for ELLs compared to non-ELLs. However, from a practical significance perspective, the linguistic features were not differentially difficult for the two groups. While the results of the linguistic feature combinations showed that the majority of the features displayed differential difficulty in favor of non-ELLs, none of them can be considered of practical significance. Finally, items with only false cognates were less difficult for ELLs with more years of exposure to Spanish as primary language of academic instruction. The benefits of the explanatory properties of English language status as a person-level predictor in a reading comprehension test along with practical implications of the current research and directions for future research are discussed.
By Chris M. Coleman (2013)
Advisor: Dr. Deborah L. Bandalos
Researchers often collect data on attitudes using “balanced” measurement scales—that is, scales with comparable numbers of positive and negative (i.e., reverse-scored) items. Many previous measurement studies have found the inclusion of negative items to be detrimental to scale reliability and validity. However, these studies have rarely distinguished among negatively-worded items, negatively-keyed items, and items with negative wording and keying. The purpose of the current study was to make those distinctions and investigate why the psychometric properties of balanced scales tend to be worse than those of scales with uniformly positive wording/keying.
A mixed-methods approach was employed. In Study 1 (quantitative), item wording and keying were systematically varied in adaptations of two published attitude measures that were administered to a large college student sample. Reliability and dimensionality of the resulting data were examined across the measures in each of four wording/keying configurations. Study 2 (qualitative) incorporated a mix of the same four wording/keying conditions in an adapted measure that was administered individually to a small sample of college students. A think-aloud design was implemented to elicit verbalizations that were subsequently analyzed using a thematic networks approach.
Study 1 findings indicated that reliability was generally highest for scales where all items were positively worded/keyed and lowest for scales with balanced keying (or balanced keying and wording). Regarding dimensionality, method variance was more evident when keying was balanced than when keying was consistent. This tended to be the case whether wording was balanced or consistent. Study 2 revealed a number of factors that could contribute to differences in the response patterns elicited by negative and positive measurement items. These factors included the relative difficulty of processing negatively-worded statements, respondent characteristics such as reading skill and frustration tolerance, and idiosyncratic response styles. Among previously posited explanations for the differential functioning of negative and positive items, results from the studies supported some explanations (e.g., method variance; careless responding) more than others (e.g., the substantive explanation). Finally, it appeared that the psychometric consequences of balanced keying are no less substantial than those of balanced wording.
Methods for Identifying Differential Item and Test Functioning: An Investigation of Type I Error Rates and Power
by Amanda M. Dainis, Ph.D. (2008)
Advisors: Dr. J. Christine Harmes and Dr. Christine DeMars
This study examined bias, and therefore fairness, by investigating methods used for identifying differential item functioning (DIF). Four DIF-detection methods were applied to simulated data and empirical data. These techniques were selected to focus on a relatively new method, DFIT, and compare it to another IRT-based method (likelihood ratio test), and two Classical Test Theory-based methods (logistic regression and Mantel-Haenszel). Within the simulation study, four factors were manipulated: sample size, the presence and absence of impact, the uniformity and non-uniformity of the DIF, and the magnitude of the DIF. The Type I error and power rates of the methods were examined, and results indicated that the performance of the methods depended on the data conditions. The DFIT method had low Type I error rates across all simulated conditions. Regardless of the absence or presence of impact, the likelihood ratio test and the logistic regression main effect test had elevated Type I error rates under both sample size conditions. While the Mantel-Haenszel method's error rates were satisfactory across all conditions, its power was low when detecting non-uniform DIF. High power was demonstrated by the DFIT and likelihood ratio methods, but the logistic regression method yielded unsatisfactory power rates under the impact present condition. The DFIT method, as the central focus of this investigation, warrants further attention. A particular concern is the method's performance when applied to smaller sample sizes, due to fitting a 3PL model to a dataset with insufficient sample size. Another area for further investigation is the Item Parameter Replication (IPR) procedure, which is used to establish statistical significance within the DFIT framework. Although it has proven to be a reasonably efficient technique for establishing statistical significance, its conservative performance in the empirical portion of this study suggests the need for further examination under conditions with smaller amounts of DIF. DIF detection plays an integral part in constructing a fair and unbiased test. Based on empirical evidence, such as that reported here, researchers and practitioners should examine how an item or test is functioning statistically before spending resources to examine a conceptual, underlying cause of DIF.
by Susan Lynn Davis, Ph.D. (2005)
Advisor: Dr. Sara Finney
Assessing student development can be a challenge in that such constructs are difficult to define and difficult to measure. However, the need exists for universities to understand student's personal development as they progress though college. Although there are many important facets of student development worthy of examination, this study focused on one aspect of development commonly referenced in university mission statements: students' premonition for lifelong learning. Previous research has noted the difficulty in determining if universities are creating lifelong learners; however, this study attempted to examine this development by means of a related concept: student achievement goal orientation. One cohort of students was assessed on three occasions during college to estimate change in five dimensions of student achievement goal orientation: mastery-approach, performance-approach, mastery-avoidance, performance-avoidance, and work-avoidance. In addition to addressing the need for information on student development, this study attempted to address the shortcomings of prior longitudinal research, for example, by employing specific methodologies that allow inclusion of partial records, estimation of individual variation within change, examination of measurement invariance, and fluctuation within patterns of change. Before estimating change over time, it was first determined that the measurement of goal orientation was psychometrically stable across the three assessments, as indicated by the sufficient level of measurement invariance. Change was estimated using Latent Growth Modeling which allowed the estimated pattern of change to be explicitly identified and described. Individual variation in change was also found and used to address ancillary research questions regarding change across dimensions of goal orientation and the relationship between initial goal orientation and change in goal orientation. All five dimensions of goal orientation exhibited significant change across the three assessments. The identified patterns of change present interesting information for student development and student motivation. Discussion of this estimated change includes exploration of the change in terms of achievement goal orientation, students' motivational perspective, and the development of lifelong learners.
Construct Validity Evidence for University Mattering: Evaluating Factor Structure, Measurement Invariance, and Latent Mean Differences of Transfer and Native Students
by Megan France (2011)
Advisor: Dr. Robin D. Anderson
The psychological construct university mattering is defined as the feeling that one makes a difference and is significant to his or her university’s community. University mattering emerged from the theory of general mattering, which describes mattering as a complex construct consisting of the facets awareness, importance, ego-extension and reliance. Researchers have attempted to operationalize university mattering through the development of various measures. Specifically, the Mattering Scale for Adults in Higher Education (MHE), the College Mattering Inventory (CMI) and the University Mattering Scale (UMS). The MHE and CMI were not developed based on an underlying theory of mattering and do not map to the facets listed above. The UMS was developed by writing items to represent these facets; however, after a psychometric evaluation of this scale, researchers provided numerous suggestions for improving the scale and the measurement of university mattering. Those suggestions were employed and the Revised University Mattering Scale (RUMS) was developed for use in the current study.
The purpose of this dissertation was twofold. First, the model-data fit of the RUMS was evaluated using confirmatory factor analysis (CFA). Five a priori models were tested using two independent samples: (a) a one-factor model, (b) a four-factor model, (c) a higher-order model, (d) a bifactor model, and (e) an incomplete bifactor model. In Sample 1, the incomplete model had the best overall fit. In Sample 2, the bifactor model had the best overall fit, which was surprising given that an admissible solution could not be found for this model in the first sample. However, across both samples, there were areas of localized misfit (i.e., large standardized covariance residuals). Furthermore, in using the incomplete bifactor and bifactor models to evaluate the items, numerous items were factorially complex (i.e., items cross-loaded to both the general mattering factor and their corresponding specific factor. Thus, ten items were removed. A modified model with 24-items was then fit to the data. Although fit improved, there were still areas of concern. Specifically, three items with large standardized residuals and six items that cross-loaded were deleted. The resulting 15-item measure fit a one-factor structure well and was named the Unified Measure of University Mattering-15 (UMUM-15).
The second purpose of this study was to assess the measurement invariance of the UMUM-15, the measure championed from Study 1. Of particular interest to this study was the comparison of transfer student scores on university mattering to scores of native students (i.e., students who began at the institution as first-years with no transfer credit). Transfer students constitute a large subgroup on many campuses. Transfer students often express struggling with their academic and social integration at their new campus. Therefore, it is possible that transfer students have a lower sense of mattering than native students. Qualitative research indicates that transfer students frequently report feeling a lack of mattering after relocating to their new college campuses. Furthermore, by definition, transfer students may lack feelings of mattering because they are in transition. Schlossberg (1989) theorized, “…people in transition often feel marginal and that they do not matter” (p. 6). For this study, the evaluation of measurement invariance was conducted using a step-by-step procedure introducing more equality constraints on the model across groups at each step (Meredith, 1993; Steenkamp & Baumgartner, 1998). Tests of measurement invariance began with testing configural invariance, followed by metric invariance, and finally, scalar invariance. With the establishment of scalar invariance, latent mean differences between transfer and native students were interpreted. As expected, transfer students had lower latent means than native students on university mattering. Not only did this study provide strong construct validity evidence for the UMUM-15, but this study also made several notable contributions to the current research on university mattering.
by Keston H. Fulcher, Ph.D. (2004)
Advisor: Dr. T. Dary Erwin
Construct ambiguity and methodological shortcomings of instrument development have obscured the meaning of curiosity research. Nonetheless, it is an important construct, especially since it has been linked recently to lifelong learning. The purpose of these studies is to collect validity evidence for a new self-report questionnaire, The Curiosity Index (CI), which is based on Ainley's (1987) parsimonious breadth and depth conceptualization of curiosity. Proctors administered the CI to 1042 college freshmen, 854 college sophomore/juniors, and 74 members of a lifelong learning institute. In Study 1, freshmen CI data were analyzed using confirmatory factor analysis in an exploratory manner to identify items best representing the two-factor model. After selective item removal, all indices except for the RMSEA suggested good fit. In addition to the CI, college freshmen took several other instruments. In Study 2, scores derived from these instruments were correlated to the total, breadth, and depth scores. As predicted, the total CI, breadth, and depth scores correlated moderately to highly with trait curiosity and intrinsic motivation, lowly to confidence, not at all to intelligence or extrinsic motivation, and negatively to work-avoidance. In addition, mastery-approach correlated higher to depth than to breadth as predicted. In Study 3, average total, breadth, and depth scores of freshmen, sophomore, and lifelong learners were compared via ANOVAs. It was predicted that lifelong learners would have the highest scores on all categories, then sophomores, then freshmen. Lifelong Learning Institute members and sophomores did score significantly higher on total and depth curiosity than freshmen; however, no other differences were found. In Study 4, item response theory was used to investigate the amount of information obtained by the CI along the continuum of curiosity, from the least curious to the most curious students. Generally, information was high; however, students scoring 1.5 SD s above the mean or higher were measured less reliably. Overall, the results support the use of the Curiosity Index for measuring breadth and depth curiosity. Future directions of validation include additional correlational studies with other curiosity measures, reversing the response scale, and creating more difficult breadth items.
by Makayla Grays (2013)
Advisor: Dr. Robin Anderson
Students must be sufficiently motivated in order to achieve the intended learning outcomes of their college courses. Research in education and psychology has found motivation to be context-dependent. Therefore, students’ motivation is likely to differ from one semester to the next according to which courses students are taking. However, there are also instances in which motivation levels may not change over time. In order to determine whether motivation for coursework changes across the academic career (and, if so, what variables may be related to that change), it is imperative to use a measure of motivation that is theoretically and psychometrically sound. In addition, the measure should function consistently over time—that is, the motivation measure must demonstrate longitudinal invariance. The purpose of this research was to investigate the factor structure and longitudinal invariance of a measure of motivation for coursework—the Expectancy, Value, and Cost Scale (EVaCS)—for incoming and mid-career college students. Study 1 examined the factor structure of the EVaCS and found support for a correlated three-factor model. The longitudinal invariance of this model was examined in Study 2, and results established the EVaCS to be an invariant measure of motivation for coursework across the two time points. An analysis of latent mean differences showed no significant overall mean changes in Expectancy and Value over time, but a statistically and practically significant increase was found for Cost (p < .05, d = 0.46). In addition to establishing the EVaCS as a structurally sound instrument, this research has implications for the measurement of motivation for coursework and the theoretical conceptualization of motivation.
Examining the Psychometric Properties of a Multimedia Innovative Item Format: Comparison of Innovative and Non-Innovative Versions of a Situational Judgment Test
by Sara Lambert Gutierrez, Ph.D. (2009)
Advisor: Dr. J. Christine Harmes
In the measurement field, innovative item formats have shown promise for increasing the capability to assess constructs not easily measured with traditional item formats. These items are often assumed to also provide opportunities for better measurement. However, little empirical research exists to support these assumptions. The purpose of this study was to explore the psychometric properties of a multimedia innovative item type and then compare the results to the properties of a non-innovative item format. Participants were administered one of two tests of identical content: one consisting of an innovative item format and the other consisting of a non-innovative item format. Exploratory factor analyses were conducted to evaluate the dimensionality of the two tests. The graded-response model was fit to both tests to produce item and test level characteristic curves, allowing for the examination of the reliability, or information, produced by each test and the individual items. Measurement efficiency, a ratio of the average amount of information provided relative to the average amount of time taken, was also reviewed. Face validity was examined by analyzing participant ratings on an eight-item post-test survey. Finally, criterion-related validity was investigated for the innovative item format by examining the relationship between test scores and supervisors’ ratings of employee performance. Findings from this research suggest that the use of innovative items may alter the underlying construct of an assessment, and could potentially provide more measurement information about examinees with low prioritization skills. Also, innovative item formats do not necessarily decrease measurement efficiency, as has been previously suggested. Participants’ perceptions of the tests indicated that they felt the innovative version provided a more realistic experience and increased levels of engagement. Criterion-related validity scores on the innovative version was inconsistent across two samples. The key implication of these results applies to any practitioner employing innovative items; the addition of innovative item formats likely alters the measurement properties of a test. Further examination is needed to understand whether or not the alteration results in better measurement. As the overall psychometric functioning of both versions of the assessment was low, replication is recommended prior to generalizing these results.
Integrating and Evaluating Mathematical Models of Assessing Structural Knowledge: Comparing Associative Network Methodologies
by Emily R. Hoole, Ph.D. (2005)
Advisor: Dr. Christine DeMars
Structural knowledge assessment is a promising area of study for curriculum design and teaching, training, and assessment, but many issues in the field remain unresolved. This study integrates an associative network method, the Power Algorithm from the field of text comprehension into the realm to structural knowledge assessment by comparing it to an already established associative network method, Pathfinder Analysis. Faculty members selected the fifteen most important concepts in Classical Test Theory. Students and faculty then completed similarity ratings for each concept pair using an online survey program, SurveyMonkey. A variety of similarity measures for the Power Algorithm networks and Pathfinder networks were used to predict course performance in a graduate level measurement class. For the Power Algorithm networks, the correlation between the student and expert links between the concepts in the associative network were computed, along with the congruence coefficient between the associative network links. Finally, a measure of network coherence, harmony, was calculated for each Power Algorithm network. For the Pathfinder networks, the NETSIM measure of similarity between the student and expert networks was computed. An unusual finding for the Pathfinder measure of similarity, NETSIM, was uncovered, in which NETSIM values negatively predicted course performance. Results indicate that the Power Algorithm similarity measures did not uncover a latent structure in the data, but that network harmony might possibly serve as an indicator of quality for knowledge structures. Further investigation of the use of harmony in structural knowledge assessment is recommended.
by S. Jeanne Horst, Ph.D. (2010)
Advisor: Dr. Sara J. Finney
Despite high-stakes applications of assessment findings, assessment data are frequently collected in situations that are of low-stakes to examinees. Because low-stakes tests are of little consequence to the examinees, test-taking motivation and thus the validity of inferences drawn from unmotivated examinees’ scores are of concern. The current study explored examinee self-reported effort in a several-hours long low-stakes testing context via both structural equation mixture modeling and latent growth modeling approaches. An indirect approach to the structural equation mixture modeling results provided a heuristic for understanding examinee motivation in the low-stakes context. External criteria related to effort, such as goal orientations, self-efficacy for mathematics, and personality variables contributed to explanations for three classes of examinees: higher-effort, mid-effort, and lower-effort. Expectancy-value theory, personality traits and fatigue explanations of examinee motivation in a low-stakes context are considered.
Using Verbal Reports to Explore Rater Perceptual Processes in scoring: An Application to Oral Communication Assessment
by Jilliam N. Joe, Ph.D. (2008)
Advisor: Dr. J. Christine Harmes
Performance assessment has shown increasing promise for meeting educators' needs for "authenticity" in assessment that many argue is missing from standardized multiple choice testing. However, for all of its merits, performance assessment continues to present a formidable challenge to measurement theory and practice when human raters are a component of scoring. There is little known about the cognitive processes raters employ in scoring, and in particular, scoring for oral communication assessments. The purpose of this study was to explore feature attention within an oral communication assessment scoring context, and how feature attention influenced decisions. An additional purpose was to investigate the utility of verbal reports as a method for collecting perceptual data within an aurally and visually intensive context. The present study employed a concurrent complementarity mixed methods design (Greene, Carcelli, & Graham, 1989), in which concurrent and retrospective verbal report methods were used to gather cognitive data from experienced and inexperienced raters. Specifically, verbal report data were examined to discover meaningful patterns in feature attention, as well as alignment between raters' internal frameworks and the test developer's scoring framework. Generalizability Theory was used to answer questions related to verbal report impact on scoring. Self-report data on perceived difficulty of the scoring task were also collected within each condition of verbal reporting. The findings from this research suggest that raters' internal frameworks as applied in the service of scoring did not align with the test developer's framework. Raters did not consistently attend to the features found in the scoring rubric, nor did they adhere to the scoring system (analytic). Raters demonstrated complex integrative processes that often violated assumptions held about the rating process. Experienced raters, in particular, engaged in feature attention and subsequent decision-making that often "borrowed" information from other traits to better inform judgments, particularly when the rater endeavored to establish causal relationships for failures in trait mastery. These findings have several implications for rater selection and training procedures, as well as test development in oral communication.
Using the Right Tool for the Job: An Analysis of Item Selection Statistics for Criterion-Referenced Tests
by Andrew T. Jones, Ph.D. (2009)
Advisor: Dr. Christine DeMars
In test development, researchers often depend upon item analysis in order to select items to retain or add to an exam form. The conventional item analysis statistic is the point-biserial correlation. This statistic was developed to select items that would maximize the reliability indices of norm-referenced tests. When the focus of the exam is norm-referenced scores, then the point-biserial correlation works well as an item selection tool. However, the point-biserial correlation is also used in testing contexts where it may be less useful, specifically on criterion-referenced tests. Criterion-referenced tests have different reliability indices than norm-referenced tests, known as decision consistency indices. As such, using the point-biserial correlation to select items to maximize decision consistency may not have as much utility as other options. Researchers have developed several criterion-referenced item analysis statistics that have yet to be fully evaluated for their utility in selecting items for criterion-referenced tests. The purpose of this research was to evaluate each of the respective criterion-referenced item selection tools as well as the point-biserial correlation to determine which one optimized decision consistency.
by Cassandra R. Jones, Ph.D. (2009)
Advisor: Dr. Donna Sundre
Recently more universities have started administering course evaluations online. With the process no longer in the classroom, some students decide not to complete their course evaluations during their own time, resulting in concerns about online course evaluation results being biased because of lack of response. This study examined course evaluation results at a small diverse Mid-Atlantic Catholic university. A cross-classified random effects model was used to capture student responses across all of their courses. Nonresponse bias was examined by determining predictors of participation and predictors of online course evaluation ratings. Variables predicting both participation and ratings were considered to be a potential source of nonresponse bias. It was found that gender, ethnicity, and final course grade predicted online course evaluation ratings. Only final course grade predicted online course evaluation ratings.
Assessing Model Fit of Multidimensional Item Response Theory and Diagnostic Classification Models using Limited-Information Statistics
by Daniel Jurich (2014)
Advisor: Dr. Christine DeMars
Educational assessments have been constructed predominately to measure broad unidimensional constructs, limiting the amount of formative information gained from the assessments. This has led various stakeholders to call for increased application of multidimensional assessments that can be used diagnostically to address students' strengths and weaknesses. Multidimensional item response theory (MIRT) and diagnostic classification models (DCMs) have received considerable attention as statistical models that can address this call. However, assessment of model fit has posed an issue for these models as common full-information statistics fail to approximate the appropriate distribution for typical test lengths. This dissertation explored a recently proposed limited-information framework for full-information algorithms that alleviates issues presented by full-information fit statistics. Separate studies were conducted to investigate the limited-information fit statistics under MIRT models and DCMs.
The first study investigated the performance of a bivariate limited-information test statistic, termed M2, with MIRT models. This study particularly focused on the root mean square error of approximation (RMSEA) index computed from M2 that quantifies the degree of model misspecification. Simulations were used to examine the RMSEA under a variety of model misspecifications and conditions in order to provide practitioners empirical guidelines for interpreting the index. Results showed the RMSEA provides a useful indicator to evaluate degree of model fit, with cut-offs around .04 appearing to be reasonable guidelines for determining a moderate misspecification. However, cut-offs necessary to reject misspecified models showed some dependence on the type of misspecification.
The second study extended the M2 and RMSEA indices to the log-linear cognitive diagnostic model, a generalized DCM. Results showed that the M2 followed the appropriate theoretical chi-squared distribution and RMSEA appropriately distinguished between various degrees of misspecification. Discussions highlight how the limited-information framework provides practitioners a pragmatic set of tools for evaluating the fit of multidimensional assessments and how the framework can be used to guide development of future assessments. Limitations and future research to address these issues are also presented.
An Empirical Demonstration of Direct and Indirect Mixture Modeling When Studying Personality Traits: A Methodological-Substantive Synergy
by Pamela K. Kaliski, Ph.D. (2009)
Advisor: Dr. Sara Finney
Many personality psychology researchers have employed the person-centered approach of cluster analysis to determine how many categorical Big Five personality types exist. The majority of these researchers have suggested that three Big Five personality types exist; however, results from two recent studies suggested that five types exist. In the first part of the current study, direct mixture modeling (an alternative person-centered approach to cluster analysis), was conducted on Big Five personality variables to explore the number of Big Five personality types that exist in college students, and two methodological approaches for gathering validity evidence for the personality types were demonstrated. Although more validity evidence must be gathered, results of the direct MM suggested that three personality types may exist in college students; however, the types differed in form from the three types that are commonly reported. In the second part of the current study, the same results were used to demonstrate an application of indirect mixture modeling. As opposed to interpreting the classes as substantively meaningful discrete subgroups, they were interpreted as common configurations that best represent the aggregate dataset. Additionally, the variable-centered approach of multiple regression was conducted. A comparison of the multiple regression results and the indirect mixture modeling results reveal the similarities and differences.
by James R. Koepfler (2012)
Advisor: Dr. Christine DeMars
Over the past decade, educational policy trends have shifted to a focus on examining students’ growth from kindergarten through twelfth grade (K-12). One way States can track students’ growth is through the use of a vertical scale. Presently, every State that uses a vertical scale bases the scale on a unidimensional IRT model. These models make a strong but implausible assumption that a single construct is measured, in the same way, across grades. Additionally, research has found that variations of psychometric methods within the same model can result in different vertical scales. The purpose of this study was to examine the impact of three IRT models (unidimensional model, U3PL; bifactor model with grade specific subfactors, BG-M3PL; and a bifactor model with content specific factors, BC-M3PL); three calibration methods (separate, hybrid, and concurrent), and two scoring methods (EAP pattern and EAP summed scoring; EAPSS) on the resulting vertical scales. Empirical data based on a States’ assessment program were used to create vertical scales for Mathematics and Reading from Grades 3-8. Several important results were found. First, the U3PL model always resulted in the worst modeldata fit. The BC-M3PL fit the data best in Mathematics and the BG-M3PL fit the data best in Reading. Second, calibration methods led to minor differences in the resulting vertical scale. Third, examinee proficiency estimates based on the primary factor for each model were generally highly correlated (.97+) across all conditions. Fourth, meaningful classification differences were observed across models, calibration methods, and scoring methods. Overall, I concluded that none of the models were viable for developing operational vertical scales. Multidimensional models are promising for addressing the current limitations of unidimensional models for vertical scaling but more research is needed to identify the correct model specification within and across grades. Implications for these results are discussed within the context of research, operational practice, and educational policy.
Using Response Time and the Effort-Moderated Model to Investigate the Effects of Rapid Guessing on Estimation of Item and Person Parameters
by Xiaojing Kong, Ph.D. (2007)
Advisor: Dr. Steven Wise
Rapid-guessing behavior, an aberrant examinee behavior observed frequently in testing, creates a possible source of systematic measurement error undermining psychometric quality of items and tests, and the validity of test scores. The purposes of this dissertation were to examine how and to what extent rapid guessing can impact item parameter and proficiency estimates, and to explore and evaluate the effectiveness of specific psychometric models controlling for rapid guesses. Five interrelated studies were conducted, involving the use of item response times for detecting rapid-guessing behavior in the empirical study, and the employment of the observed distribution of response time effort in the simulation studies. The primary investigation involved comparing the performance of the standard IRT models (i.e., 3PL, 2PL, and 1PL) with that of the effort-moderated item response model (Wise & DeMars, 2006) and its variations (i.e., EM-3PL, EM-2PL, and EM-1PL), with respect to model fit, item parameter estimates, proficiency estimates, and test information and reliability. The performance discrepancies were first studied using data from a computer-based, low-stakes achievement test. The direction and magnitude of estimation bias under each model were further examined in such simulated conditions that the proportions of rapid guesses presented in the data varied. Moreover, comparisons between the standard and EM models were conducted for conditions in which the probability of guessing an item right was correlated with examinees' level of proficiency. Additionally, the influence of rapid guessing on item parameter estimates was examined in the framework of classical test theory. Results indicate that a small proportion of rapid guesses can bias item indices and examinee proficiency estimates to a notable extent, and that the undesirable influence can be augmented by increased proportions of rapid guesses. The EM models produced more accurate estimates of item parameters, examinee proficiency, and test information than their counterpart IRT models in most simulated conditions. However, exceptions were observed with the two- and one-parameter models. Also, different patterns were found for conditions in which some level of cognitive process was assumed to be involved during a rapid guess.
The Treatment of Missing Data when Estimating Student Growth with Pre-Post Educational Accountability Data
by Jason Kopp (2014)
Advisor: Dr. Sara Finney
To ensure program quality and meet accountability mandates, it is becoming increasingly important for educational institutions to show "value-added" for attending students. Value-added is often evidenced by some form of pre-post assessment, where a change in scores on a construct of interest is considered indicative of student growth. Although missing data is a common problem for these pre-post designs, missingness is rarely addressed and cases with missing data are often listwise deleted. The current study examined the mechanism underlying, and bias resulting from, missingness due to posttest nonattendance in a higher-education accountability testing context. Although data were missing for some students due to posttest nonattendance, these initially missing data were subsequently collected via makeup testing sessions, thus allowing for the empirical examination of the mechanism underlying the missingness and the biasing effects of the missingness. Parameter estimates and standard errors were compared between the "complete" (i.e., including makeup) data and a number of different missing data techniques. These comparisons were completed across varying percentages of missingness and across noncognitive (i.e., developmental) and cognitive (i.e., knowledge-based) measures. For both noncognitive and cognitive measures, posttest data was found to be missing-not-at-random (MNAR), indicating that bias should occur when utilizing any missing data handling technique. As expected, the inclusion of auxiliary variables (i.e., variables related to missingness, the variable with missing values, or both) decreased the conditional relationship between the posttest noncognitive measure scores and posttest attendance (i.e., missingness); however, it increased the conditional relationship between posttest cognitive measure scores and posttest attendance. Thus, utilizing advanced missing data handling with auxiliary variables resulted in reduced parameter bias and reduced standard error inflation for the noncognitive measure, but increased parameter bias for some parameters (posttest mean and pre-post mean change) for the cognitive measure. These effects became more exaggerated as missingness percentages increased. With respect to future research, additional examination of bias-inducing effects when employing missing data techniques is needed. With respect to testing practice, assessment practitioners are advised to avoid missingness if possible through well-designed assessment methods, and to attempt to thoroughly understand the missingness mechanism when missingness is unavoidable.
by Abigail R. Lau, Ph.D. (2009)
Advisor: Dr. Dena Pastor
Test-takers can be required to complete a test form, but cannot be forced to demonstrate their knowledge. Even if an authority mandates completion of a test, examinees can still opt to enter responses randomly. When a test has important consequences for individuals, examinees are unlikely to behave this way. However, random responding becomes more likely when the consequences associated with a test are less significant to the examinees. To thwart random responding, test administrators have explored methods to motivate examinees to respond attentively. Ultimately, differences in how examinees approach low-stakes tests are inevitable, and measurement models that account for this difference are needed. This dissertation provides an overview of the approaches that have been proposed for modeling low-stakes test data. Further, it specifically investigates the performance and utility of the mixed-strategies item response model (Mislevy & Verhelst, 1990) as one method of capturing amotivated examinees. Amotivated examinees are defined here as examinees who do not provide meaningful responses to any test items. A simulation study shows that if a normal item response model is used, parameter recovery rates are unacceptable when 9% or more of the examinees were amotivated. However, normal item response models may still be useful if less than 1% of examinees were amotivated. Use of the mixed-strategies item response model led to better parameter estimation than the normal item response model regardless of the proportion of amotivated examinees in the dataset. Additional research is needed to determine if using the mixed-strategies model results in satisfactory parameter recovery when greater than 20% of examinees were amotivated. A second study shows that when the mixed-strategies model was used on real low-stakes test data, the examinees classified as amotivated reported much lower test-taking effort than other examinees. However, examinees classified as amotivated were not very different than other examinees in terms of academic ability. This finding supports the notion that the second class in the mixed-strategies model is capturing amotivated examinees rather than low-ability examinees. Limitations of the mixed strategies modeling technique are discussed, as is the appropriateness of applying this technique in various testing contexts.
Comparing the Relative Measurement Efficiency of Dichotomous and Polytomous Models in Linear and Adaptive Testing Conditions
by Susan Daffinrud Lottridge, Ph.D. (2006)
Advisor: Dr. Christine DeMars
The purpose of this study was to examine the relative performance of the dichotomous and nominal item response theory models in a linear testing and adaptive testing environment. A simulation study was conducted to investigate the relative measurement efficiency when moving from a dichotomous linear test to a dichotomous adaptive test, nominal linear test, and nominal adaptive test. Item exposure was also considered. Two dichotomous models (2PL, 3PL) and two nominal models (Bock's Nominal Model, Thissen's Nominal Model) were used. The simulated data were based upon responses to a 58-item mathematics test by 6711 students, and Ramsay's nonparametric item response theory method was used to generate option characteristic curves. These curves were then used to generate simulation data. MULTILOG was used to estimate item parameters. An item pool of 522 items was generated from the 58 items, with items being shifted left or right by increments of .05 to create new items. A 30-item fixed-length test was used, as was a 30-item adaptive test. 100 simulees were generated at each of 47 [straight theta] points on [-2.3, +2.3]. Using empirically derived standard errors, results indicated that the adaptive test and polytomous linear test outperformed the dichotomous linear test. The Thissen Nominal Model linear test performed similarly to the 3PL adaptive test, suggesting its potential use in place of the more expensive adaptive test. The Bock Nominal Model linear test also performed better than the 2PL linear test, but not as well as either of the adaptive tests. Future studies are suggested for better understanding the Thissen Nominal Model in light of its performance relative to the 3PL adaptive test.
Examining the Bricks and Mortar of Socioeconomic Status: An Empirical Comparison of Measurement Methods
by Ross E. Markle, Ph.D. (2010)
Advisor: Dr. Dena A. Pastor
The impact of socioeconomic status (SES) on educational outcomes has been widely demonstrated in the fields of sociology, psychology, and educational research. Across these fields however, measurement models of SES vary, including single indicators (parental income, education, and occupation), multiple indicators, hierarchical models, and most often, an SES composite provided by the National Center for Educational Statistics. This study first reviewed the impact of SES on outcomes in higher education, followed by the various ways in which SES has been operationalized. In addition, research highlighting measurement issues in SES research was discussed. Next, several methods of measuring SES were used to predict first-year GPA at an institution of higher education. Findings and implications were reviewed with the hope of promoting more careful considerations of SES measurement.
The Effects of Item and Respondent Characteristics on Midpoint Response Option Endorsement: A Mixed-Methods Study
By Kimberly Rebecca Marsh, Ph.D. (2013)
Advisor: Dr. Dena A. Pastor
As the demand for accountability and transparency in higher education increases, so too has the call for direct assessment of student learning outcomes. Accompanying this increase of knowledge-based, cognitive assessments administered in a higher education context is an increased emphasis on assessing various noncognitive aspects of student growth and development over the course of their college career. Noncognitive outcomes are most often evaluated via self-report instruments associated with Likert-type response scales, posing unique challenges for researchers and assessment practitioners hoping to draw valid conclusions based upon this data. One long-debated characteristic of such assessments is the midpoint response option. More specifically, prior research suggests that respondents may be more or less likely to endorse the midpoint response option under different measurement and respondent dispositional conditions thus introducing construct-irrelevant variance within respondent scores. The current study expanded upon previous work to examine the effects of various item and respondent characteristics on endorsement and conceptualization of the midpoint response option in a noncognitive assessment context.
A mixed-methods approach was employed in order to fully address research questions associated with two studies – one quantitative and one qualitative in nature. Study 1, employed hierarchical generalized linear modeling to simultaneously examine the effects of respondent characteristics and experimentally manipulated item characteristics on the probability of midpoint response option endorsement. Respondent characteristics included self-reported effort expended on the assessments administered and respondent levels of verbal aptitude (SAT verbal scores). Respondents were randomly assigned different forms of the instrument which varied in item set location (scales administered earlier versus later in the instrument) and midpoint response option label (unlabeled, neutral, undecided, neither agree nor disagree). Experimental manipulation of these variables allowed for a stronger examination of these variables’ influence and how they might interact with respondent characteristics (i.e., effort, verbal aptitude) relative to previous studies investigating the issue. Study 2, employed a think-aloud protocol to further examine and understand respondent use and conceptualization of the midpoint response option upon manipulation of midpoint response option label (unlabeled, neutral, undecided, neither agree nor disagree). Four female and four male participants were randomly selected to participate in the think-aloud process using a subset of the items administered in Study 1.
Findings from both studies suggest that the MR option is prone to abuse in practice. Results of Study 1 indicate that respondent characteristics, the experimental manipulation of item characteristics, and their interactions have the potential to significantly affect probability of midpoint response option endorsement. Results of Study 2 reveal that justifications provided by respondents for midpoint response endorsement are mostly construct-irrelevant and differences in conceptualization of the midpoint response option across variations in label appear to be idiosyncratic. These findings have significant implications for the validity of inferences made based upon noncognitive assessment scores and the improvement of assessment practice.
Unfolding Analyses of the Academic Motivation Scale: A Different Approach to Evaluating Scale Validity and Self-Determination Theory
by Betty Jo Miller, Ph.D. (2007)
Advisors: Dr. Donna Sundre and Dr. Christine DeMars
Using the framework of a strong program of construct validation (Benson, 1998), the current study investigated Self-Determination Theory (SDT; Deci & Ryan, 1985), the construct of academic motivation, and the Academic Motivation Scale (AMS; Vallerand et al., 1992). Building upon a body of prior research that provided only limited support for the theory and the seven-factor structure of the scale, a technique other than factor analysis was used to analyze responses to the AMS. Specifically, the utility of a unidimensional unfolding model in analyzing such responses was explored. In addition, scale development efforts were pursued, and multiple measures of academic motivation within a single sample of students were compared. Data were collected from three large samples of university students over the period of one year. The AMS and other instruments were self-report measures administered on computer and by paper-and-pencil. Qualitative data were collected from the second sample for the purposes of exploring new content for pilot items and for explaining certain results. Results have important implications for both SDT and the measurement of academic motivation using the AMS. A unidimensional unfolding model was shown to provide adequate fit to the data, supporting the argument that academic motivation is a single construct ordered along a continuum according to increasingly internal degrees of self-regulation. Using the estimated item locations, a shortened version of the AMS was proposed that was highly reliable and consistent with SDT. Finally, a comparison of unfolded motivation scores with summated AMS subscale scores revealed the folding of the response process.
by Christopher Orem (2012)
Advisor: Dr. Keston Fulcher
Meta-assessment, or the assessment of assessment, can provide meaningful information about the trustworthiness of an academic program's assessment results (Bresciani, Gardner, & Hickmott, 2009; Palomba & Banta, 1999; Suskie, 2009). Many institutions conduct meta-assessments for their academic programs (Fulcher, Swain, & Orem, 2012), but no research exists to validate the uses of these processes' results. This study developed the validity argument for the uses of a meta-assessment instrument at one mid-sized university in the mid-Atlantic. The meta-assessment instrument is a fourteen-element rubric that aligns with a general outcomes assessment model. Trained raters apply the rubric to annual assessment reports that are submitted by all academic programs at the institution. Based on these ratings, feedback is provided to programs about the effectiveness of their assessment processes. Prior research had used Generalizability theory to derive the dependability of the ratings provided by graduate students with advanced training in assessment and measurement techniques. This research focused on the dependability of the ratings provided to programs by faculty raters. In order to extend the generalizability of the meta-assessment ratings, a new fully-crossed G-study was conducted with eight faculty raters to compare the dependability of their ratings to those of the previous graduate student study. Results showed that the relative and absolute dependability of two-rater teams of faculty ([Rho]2 = .90, [Phi] = .88) were comparable to the dependability estimates of two-rater teams of graduate students. Faculty raters were more imprecise than graduate students in their ratings of individual elements, but not substantially. Based on the results, the generalizability of the meta-assessment ratings was expanded to a larger universe of raters. Rater inconsistencies for elements highlighted potential weaknesses in rater trainings. Additional evidence should be gathered to support several assumptions of the validity argument. The current research provides a roadmap for stakeholders to conduct meta-assessments and outlines the importance of validating meta-assessment uses at the program, institutional, and national levels.
by Suzanne L. Pieper, Psy.D. (2003)
Advisors: Dr. Donna L. Sundre and Dr. Sara J. Finney
This study refining and extending the 2 x 2 achievement goal framework of mastery-approach, mastery-avoidance, performance-approach, and performance-avoidance goals had three purposes: (1)to investigate the possibility of a fifth goal orientation: work avoidance, (2)to examine the functioning of new items written to better measure the four goal orientations, and (3) to gather validity evidence for the four goal orientations and possibly a fifth goal orientation by examining the association between the variables need for achievement and fear of failure and the goal orientations. The results of this study provided support for the four-factor model of achievement goal orientation using the 12-item Achievement Goal Questionnaire (AGQ) (Elliot & McGregor, 2001) modified for a general academic domain. The four-factor model provided a good fit to the data and a better fit than competing models. Second, the results of this study provided support for the improved reliability and validity of the 16-item AGQ with one item added to each goal orientation subscale to improve measurement. Third, the results of this study provided strong evidence for the existence of a fifth goal orientation: work-avoidance. The five-factor model of goal orientation--mastery-approach, mastery-avoidance, performance-approach, performance-avoidance, and work-avoidance--as measured by the 20-item AGQ provided a good fit to the data. Furthermore, the work-avoidance orientation demonstrated relationships with the criterion variables workmastery, competitiveness, and fear of failure that were expected based on previous theory and research. While this study answers the call of Maehr (2001) to reinvigorate goal theory by considering many possible ways students engage in learning, much still needs to be done in terms of defining and assessing the work-avoidance goal orientation. Additionally, the limitations of this study need to be addressed. The results of this study need to be validated with other student populations and in a variety of educational contexts. Finally, because the same sample of college students was used for all three analytical stages of this study, thereby increasing the possibility for Type 1 error, future studies need to validate these results with fresh samples.
by Shelley Ragland, Ph.D. (2010)
Advisor: Dr. Christine E. DeMars
In order to be able to fairly compare scores derived from different forms of the same test within the Item Response Theory framework, all individual item parameters must be on the same scale. A new approach, the RPA method, which is based on transformations of predicted score distributions was evaluated here and was shown to produce results comparable to the widely used Stocking-Lord (SL) method under varying conditions of test length, number of common items, and differing ability distributions in a simulation study. The new method was also examined using actual student data and a resampling analysis. Both the simulation study and actual student data study resulted in very similar transformation constants for the RPA and SL methods when 15 or 10 common items were used. However, the RPA method produced greater variance, especially when only 5 common items were used in the actual student data analysis compared to the SL method. The simulated and actual data research findings demonstrate that the RPA method is a viable option for producing the transformation constants necessary for transforming separately calibrated item parameter estimates prior to equating.
Comparability of Paper-and-Pencil and Computer-Based Cognitive and Non-Cognitive Measures in a Low-Stakes Testing Environment
by Barbara E. Rowan, Ph.D. (2010)
Advisor: Dr. Joshua T. Goodman and Dr. J. Christine Harmes
Computerized versions of paper-and-pencil tests (PPT) have emerged over the past few decades, and some practitioners are using both formats concurrently. But computerizing a PPT may not yield equivalent scores across the two administration modes. Comparability studies are required to determine if the scores are equivalent before treating them as such. These studies ensure fairer testing and more valid interpretations, regardless of the administration mode used. The purpose of this study was to examine whether scores from paper-based and computer-based versions of a cognitive and a non-cognitive measure were equivalent and could be used interchangeably. Previous research on test score comparability used simple methodology that provided insufficient evidence for the score equivalence. This study, however, demonstrated a set of methodological best practices, providing a more complex and accurate analysis of the degree of measurement invariance that exists across groups. The computer-based test (CBT) and PPT contained identical content and varied only in administration mode. Participants took the tests in only one format, and the administration was under low-stakes conditions. Confirmatory factor analyses were conducted to confirm the established factor structure for both the cognitive and the non-cognitive measures, and reliability and mean differences were checked for each subscale. The scalar, metric, and configural invariance were tested across groups for both measures. Because of the potential impact on measurement invariance, differential item functioning (DIF) was tested and those items were removed from the data set; measurement invariance across test modes was again evaluated.
Results indicate that both the cognitive and the non-cognitive measures were metric invariant (essentially tau-equivalent) across groups, and the DIF items did not impact the degree of measurement invariance found for the cognitive measure. Therefore, the same construct was measured to the same degree, but scores are not equivalent without rescaling. Measurement invariance is a localized issue, thus, comparability must be for each instrument. Practitioners cannot assume that the scores obtained from the PPT and CBT will be equivalent. How these test scores are used will determine what changes must be made with tests that have less than strict measurement invariance.
Development and Validation of the Preservice Mathematical Knowledge for Teaching Items (PMKT): A Mixed-Methods Approach
by Javarro Russell (2011)
Advisor: Dr. Robin D, Anderson
Mathematical knowledge for teaching (MKT) is the knowledge required for teaching mathematics to learners. Researchers suggest that this construct consists of multiple knowledge domains. Those domains include teachers’ knowledge of mathematical content and knowledge about teaching mathematics. These domains of MKT have been theoretically and empirically examined to determine their effects on K-12 student achievement. However, empirical evidence of this relationship is limited due to a lack of measures to assess MKT.
Recently, researchers have constructed measures of MKT to evaluate the effectiveness of professional development activities with in-service teachers. These measures, however, lack validity evidence for use in teacher education program assessment. Program assessment allows programs to determine the effectiveness of their curriculum on assisting preservice teachers in meeting learning outcomes. This process requires adequate tools for assessing the extent to which students meet the learning outcomes. In a teacher education program, some of those learning outcomes are related to MKT. To assess these outcomes, teacher educators need measures of MKT that relate to their learning objectives. Previous research has not supported the use of any current measure for assessing MTK in a teacher education program.
To address this gap in the literature, a process of construct validation was conducted for items developed to assess MKT at the program level of a teacher education program. Validation evidence for the items was obtained by using Benson’s framework of a strong program of construct validation. The factor structure of the items was analyzed and expected group differences were assessed. Qualitative data from cognitive interviews were then used to provide convergent evidence in regards to the construct validity of the items. The overall purpose of these methods of inquiry was to develop items that would reflect the MKT that resulted from a teacher education mathematics curriculum.
Results from factor analyses indicated that the 23 PMKT items could plausibly be composed into an 11- item essentially unidimensional scale of specialized content knowledge. The factor underlying responses to the 11-item scale appeared to be related to a specified learning objective of the program. This learning objective suggests that graduating preservice teachers should be able to evaluate a K-8 student’s mathematical work or arguments to determine if the ideas presented are valid. Interviews with participants revealed themes indicating that the items were measuring this aspect of specialized content knowledge. Comparisons among students at differing levels of the mathematics education curriculum revealed significant, but small differences between upper level preservice teachers and preservice teachers whom received no instruction. Further analysis of these items indicated that they could be improved by focusing future item development on examining misconceptions in evaluating mathematical arguments. These findings have several implications for teacher education program assessment, as well as item development for measuring MKT.
by Kelly A. Williams Scocos, Psy.D. (2002)
Advisor: Dr. Steven L. Wise
The goal of this dissertation was to investigate the viability of the most broadly accepted definition of critical thinking. This definition is the Delphi model (Facione, 1990) and it receives support from professionals both in education and in business. A single, multipart instrument, the Williams Critical Thinking Assessment, was developed to measure the individual facets of critical thinking delineated by the Delphi conceptualization. Results indicated that the Delphi model constituted a workable critical thinking definition. Furthermore, critical thinking defined in a manner consistent with the Delphi model was demonstrated to be distinct from scholastic achievement. Educationally, these discoveries have implications for both critical thinking instruction and learning in a collegiate environment.
by J. Carl Setzer, Ph.D. (2008)
Advisor: Dr. Dena Pastor
Recently, there have been two types of model formulations used to demonstrate the utility of explanatory item response models. Specifically, the generalized linear mixed model (GLMM) and hierarchical generalized linear model (HGLM) have expanded item response models to include covariates for item effects, person effects, or both simultaneously. Both frameworks have recently been garnering greater attention in the educational measurement field. Despite these two frameworks being conceptually equivalent, much of the related literature has emphasized one or the other. However, to date, there has been little attempt to associate the frameworks together. In addition, item response models that have been described within the GLMM and HGLM frameworks have mostly been of the unidimensional type. Very little has been done to demonstrate the utility of an explanatory multidimensional item response model. As explanatory models become more prevalent in research and practice, it is important to maintain software that can estimate them. SAS is an all-purpose and widely-used program that can estimate explanatory item response models. However, no previous research has examined how well SAS can recover the parameters of an explanatory multidimensional Rasch model (EMRM). There were three main goals of this study. First, several types of Rasch models, including both non-explanatory and explanatory models, were summarized within the GLMM and HGLM frameworks. The equivalence of these two frameworks was demonstrated for each model. Second, a parameter recovery study was performed to determine how well SAS PROC NLMIXED can recover the parameters of an EMRM. The effect of sample size and test length on parameter recovery was assessed. The results of the simulation study indicate that very little bias occurs, even with small sample sizes and short test lengths. The final goal was to demonstrate the utility of an EMRM model using empirical data. Using data collected from the Marlowe-Crowne Social Desirability Scale (MCSDS), an EMRM was fit to the data while using gender as a covariate. Interpretations of the model parameter estimates were given and it was concluded that gender did not explain a significant amount of variation in either of the MCSDS subscales.
Cyberspace Versus Face-to-Face: The Influence of Learning Strategies, Self-Regulation, and Achievement Goal Orientation
by Kara Owens Siegert, Ph.D. (2005)
Advisor: Dr. Christine DeMars
Web-based education (WBE) is a popular educational format that allows certain learning and teaching advantages. However, some students may not learn or perform as well in this environment as compared to traditional face-to-face education (F2FE) settings. Little research has examined the differential impact of learner characteristics on performance in these two environments. This study explored differences in learning strategies, self-regulation skills, and achievement goal orientation, in WBE and F2FE college classrooms and found that students in the two environments could be differentiated based on the composite of learner characteristics. Specifically, WBE and F2FE students differed in terms of self-regulation, elaboration, and mastery-avoidance goals. Learner characteristics, however, did not have a differential influence on college student performance in the two environments.
by Alan Socha
Advisor: Dr. Christine DeMars
Educational tests often use blueprints to define subsections, or subscales. Many tests, despite being based on blueprints, are not designed to support the precise estimation of these subscores. For these tests, the subscores would be too unreliable to be an accurate measure of an individual's knowledge or skills in a subscale. Recently, however, the educational marketplace has been pressuring testing and assessment programs to report subscores. The simple reporting of subscores opens the door to many misuses, such as decisions that may be harmful to stakeholders (e.g., flawed diagnostic, curricular, or policy decisions). The key question, as asked by Wainer et al. (2001), is "how can we obtain highly reliable scores on whatever small region of the domain spanned by the test that might be required for a particular examinee, without taking an inordinate amount of time for all other examinees?" (p. 344).
It is expensive to expand portions of the test in order to achieve more accurate estimates of subscores; therefore, subscores are sometimes stabilized through augmentation. There are a wide variety of augmented score methods and studies investigating the performance of each. Almost all of these studies assume that each item measures one, and only one, subscale. This study investigated how several of the augmented methods perform when some of the items measure multiple subscales (i.e., when some items are complex items). This study also investigated how several augmented methods perform when these complex items were treated as measuring a single subscale.
The results of this study suggest that each method is robust to model misspecification. That is, borrowing information from the subscale that items cross-load on is making up for misspecifying the item as having simple structure. Unidimensional item response theory performed better than modifications of both the Wainer et al. (2001) and Haberman (2008) approaches in this study, and is therefore the best alternative for estimating subscores in tests similar to those generated in the current study when multidimensional item response theory is not feasible.
Should We Worry About the Way We Measure Worry Over Time? A Longitudinal Analysis of Student Worry During the First Two Years of College
by Peter J. Swerdzewski, Ph.D. (2008)
Advisor: Dr. Sara Finney
This study evaluated longitudinal change in student worry using the Student Worry Questionnaire-30 (SWQ-30), an instrument that represents worry as six separate factors: (1) Worrisome Thinking, (2) Financial-Related Concerns, (3) Significant Others' Well-Being, (4) Academic Concerns, (5) Social Adequacy Concerns, and (6) Generalized Anxiety Symptoms. Prior to evaluating longitudinal change, the factor structure of the SWQ-30 was examined using four cross-sectional independent samples. A best-fitting six-factor model was found that removed four redundant items from the original 30-item instrument. This six-factor 26-item model was then fit to data from a longitudinal sample of students who completed the measure as entering freshmen and second-semester sophomores. Evidence for full configural and metric invariance was found. When the data were tested for scalar invariance, one item from each of the following subscales was found to be scalar non-invariant: Worrisome Thinking, Social Adequacy Concern, and Financial-Related Concern. Additionally, most of the items from the Generalized Anxiety Symptoms factor were found to be scalar non-invariant, thus making the latent mean difference for the factor uninterpretable. Overall, interpretable latent mean differences and stability estimates provided evidence that student worry was stable over time, although students appeared to decrease in the degree to which they worried about social adequacy. These findings suggest that some aspects of worry and the infamous sophomore slump may be unrelated phenomena. In sum, the SWQ-30 is a promising measure of multidimensional student worry; however, it has not received adequate empirical study. Furthermore, given the dearth of empirical research examining the stability of student worry over time and the unique characteristics of the samples under study, future research must be conducted to better uncover the link between worry and sophomore slump.
An Application of Generalizability Theory to Evaluate the Technical Quality of An Alternate Assessment
by Melinda A. Taylor, Ph.D. (2009)
Advisor : Dr. Dena Pastor
Federal regulations require testing of students with the most severe cognitive disabilities; although, little guidance has been given regarding the format of such assessments or how technical quality should be documented. It is well documented that specific challenges exist with the documentation of technical quality for alternate assessments that are often less standardized than their general assessment complements. One of the first steps in documenting technical quality is to determine the reliability of scores resulting from an assessment. Typical measures of reliability under a classical test theory framework, such as coefficient alpha, do little in modeling the multiple sources of error that are characteristic of alternate assessments. Instead, Generalizability theory (G-theory) allows rese! ! archers to identify potential sources of variability in scores and to analyze the relative contribution of each of those modeled sources. The purpose of this study was to demonstrate an application of G-theory to examining the technical quality of scores from an alternate assessment. A G-study where rater type, assessment attempts, and tasks were identified as facets was examined to determine the relative contribution of each facet to observed score variance. Data resulting from the G-study were used to examine the reliability of scores using a criterion-referenced interpretation of error variance associated with scores. The current assessment design was then modified to examine how changes in the design might impact the reliability of scores. Based on established criteria, the proposed designs were evaluated in terms of their ability to yield acceptable reliability coefficients. As a final step in the analysis, designs that were deemed satisfactory were evaluated from a pract! ! ical standpoint with respect to the feasibility of adapting them into a statewide standardized assessment program used for student and school accountability purposes.
by Amy DiMarco Thelk, Ph.D. (2006)
Advisor: Dr. Donna L. Sundre
Published literature reveals little information about whether examinees should be told of established performance expectations prior to test taking. This study investigated whether students who are told of a test's cut scores, information about student performance from previous test administrations, or both types of information have significantly different test performance or motivation scores than those receiving only the standardized instructions. This research was conducted at a community college during regular assessment testing. Students taking a quantitative and scientific reasoning exam (QRSR) were assigned to one of four testing conditions. Motivation information was collected via two measures: Response Time Effort (RTE; Wise & Kong, 2005) and the Student Opinion Scale (SOS; Sundre, 1999). A confirmatory factor analysis was conducted to determine whether the two-factor structure of the SOS held up when administered to a community-college sample. The results support the established structure when administered in this setting. The second phase of analysis involved testing three path models to assess the impact of (a) SOS; (b) RTE; and (c) SOS and RTE on test scores. While the treatments had only small, and contradictory, effects on SOS and RTE, all three models were significant. SOS accounted for 9% of test score variance, RTE alone accounted for 16% of the variance in test scores, and the combination of RTE and SOS accounted for 19% of the variance in test scores. The final phase of the project involved interviewing a sample of students (n=8) following testing. Interviewees were asked about treatment recognition, effort, and ideas about motivating students in testing situations. While students were able to recognize the written information they had seen prior to testing, only one freely recalled the seeing additional data prior to testing. These findings call the potency of the manipulations into question. Also, while students verbally reported variations in how hard they tried, scores on the Effort subscale were not significantly different. The results of this study do not offer strong guidance on whether to tell students about cut scores prior to testing. Limitations of the research and suggestions for future research are offered.
by John Taylor Willse, Psy.D. (2002)
Advisor: Christine DeMars
Computer adaptive tests (CAT) have a tendency to capitalize on chance errors in a-parameter estimates (van der Linden and Glas, 2000). A-stratified, match difficulty, separate item-selection/item-scoring (half), and 1-pl only CATs were compared to a maximum information CAT for their ability to address the negative effects associated with controlling capitalization on chance. The CATs were evaluated in 3 simulations (i.e., using 1-, 2-, and 3-pl true item response theory models). Results were presented in terms of prevention of capitalization on chance and overall effectiveness. The phenomenon of capitalization on chance by a maximum information CAT was replicated. The astratified, match difficulty, and half CATs were successful at preventing capitalization on chance. Through consideration of overall effectiveness and ease of implementation, the match difficulty CAT was determined to be the best alternative to the maximum information CAT. The 1-pl only CAT was shown to be a poor alternative, especially in the 3-pl true item simulation.
Students’ Attitudes toward Institutional Accountability Testing in Higher Education: Implications for the Validity of Test Scores
by Anna Zilberberg (2013)
Advisor: Dr. Sara Finney
Recent calls for an increase in educational accountability in K-16 resulted in an uptick of low-stakes testing for accountability purposes and, as a result, an increased need for ensuring that students’ test scores are reliable and valid representations of their true ability. Focusing on accountability testing in higher education, the current program of research examined the role of students’ attitudes toward such tests. To this end, this program of research was comprised of two stages: (1) collecting validity evidence for a recently developed measure of students’ attitudes toward institutional accountability testing; (2) conducting a number of studies addressing substantive research questions related to these attitudes.
The analyses associated with the first stage yielded a revised psychometrically sound self-report measure of students’ attitudes toward accountability testing in higher education (SAIAT-HE-revised) that consists of three interrelated, yet conceptually distinct, subscales. Moreover, invariance of the SAIAT-HE-revised was upheld across first-year and mid-career students, indicating that the measure can be used with these two student populations to examine relationships among attitudes and with other variables, as well as to explore differences in the levels of attitudes across first-year and mid-career students. In addition, known-groups validity evidence was garnered for the SAIAT-HE-revised given the finding that mid-career students, as predicted, held more skeptical attitudes toward accountability testing than first-year students.
The analyses associated with the second stage revealed several findings pertinent to the role of attitudes on test-taking motivation and accountability test scores. More specifically, a series of structural models examined the effect of attitudes on test performance via the mediating variables of test-taking motivation (i.e., perceived test importance and test-taking effort). First, it was revealed that first-year college students’ attitudes toward state-mandated K-12 accountability testing were positively related to, but conceptually and empirically distinct from, their attitudes toward accountability testing in college. Second, the relationship between attitudes toward K-12 testing and performance on a college accountability test was fully mediated by attitudes toward college accountability testing, thereby relieving higher education administrators from needing to change students’ attitudes toward K-12 testing in an effort to improve performance on college accountability tests. Third, as predicted, the relationship between perceived importance of the tests and test performance was fully mediated by test-taking effort. Fourth, as predicted, the extent to which first-year and mid-career students were disillusioned by college accountability testing indirectly affected their test performance via perceived importance of the tests and test-taking effort. Fifth, students’ perceived understanding of the tests’ purpose directly affected test-taking effort and indirectly affected test performance via test-taking effort and perceived importance. Interestingly, the extent to which students perceived college accountability tests to be fair and valid did not influence their test-taking motivation or performance.
In addition, the relationship between attitudes toward higher education accountability tests and attendance at testing sessions was examined. Non-compliant students, who did not attend the testing session, were found to have lower levels of perceived understanding of the tests’ purpose than compliant students. The non-compliant and compliant students did not differ with respect to perceived validity of the tests or disillusionment with accountability testing.
In tandem, these findings indicate that an intervention aimed at augmenting students’ test-taking motivation and compliance with testing should occur sometime before the mid-point of students’ academic careers and should focus on clarifying the purpose of higher education accountability testing. More positive attitudes toward college accountability testing are likely to increase attendance and test-taking motivation, thereby leading to more valid test scores, and thus more accurate evaluation of academic programming.
Examining the Performance of the Metropolis-Hastings Robbins-Munro Algorithm in the Estimation of Multilevel Multidimensional IRT Models
by Bozhidar M. Bashkov, Ph.D. (2015)
Advisor: Dr. Christine E. DeMars
The purpose of this study was to review the challenges that exist in the estimation of complex (multidimensional models applied to complex (multilevel) data and to examine the performance of the recently developed Metropolis-Hastings Robbins-Munro (MH-RM) algorithm (Cai, 2010a, 2010b), designed to overcome these challenges and implemented in both commercial and open-source software programs. Unlike other methods, which either rely on high-dimensional numerical integration or approximation of the entire multidimensional response surface, MH_RM makes use of Fisher's Identity to employ stochastic imputation (i.e., data augmentation) via the Metropolis-Hastings sampler and the apply the stochastic approximation method of Robbins and Munro to approximate the observed data likelihood, which decreases estimation tremendously. Thus, the algorithm shows great promise in the estimation of complex models applied to complex data.
To put this promise to the test, the accuracy and efficiency of MH-RM in recovering item parameters, latent variances and covariances, as well as ability estimates within and between groups (e.g., schools) was examined in a simulation study, varying the number of dimensions, the intraclass correlation coefficient, the numbers of clusters, and cluster size, for a total of 24 conditions. Overall, MH-RM performed well in recovering the item, person, and group-level parameters of the model. More replications are needed to better determine the accuracy of the analytical standard errors for some of the parameters. Limitations of the study, implications for educational measurement practice, and directions for future research are offered.
by Jerusha J. Gerstner, Ph.D. (2015)
Advisor: Dr. Deborah L. Bandalos
Researchers have studied item serial-order effects on attitudinal instruments by considering how item-total correlations differ based on the item's placement within a scale (e.g., Hamilton & Shuminsky, 1990). In addition, other researchers have focused on item negative-keying effects on attitudinal instruments (e.g., Marsh, 1996). Researchers consistently have found that negatively-keyed items relate to one another above and beyond their relationship to the construct intended to be measured. However, only one study (i.e., Bandalos & Coleman, 2012) investigated the combined effects of serial-order and negative-keying on attitudinal instruments. Their brief study found some improvements in fit when attitudinal items were presented in a unique, random order to each participant, which is easily implemented using computer survey software.
In this study I replicated and extended these findings by considering three attitudinal scales: Conformity Scale (Goldberg et al., 2006; Jackson, 1994) and two subscales of the Big Five – Conscientiousness and Agreeableness (John & Srivastava, 1999). In addition, I collected and analyzed qualitative data in the form of think-alouds and used these data to inform the quantitative results in an explanatory sequential mixed-methods design (Creswell, 2011). I administered three different groupings of the items on these three instruments to random groups of university students. The items were displayed in either a blocked (i.e., all positively-keyed items followed by all negatively-keyed items), alternating (i.e., items alternated keying every other item beginning with a positively-keyed item), or random (i.e., items presented in a different random order for each participant) order.
When each participant saw a different randomly-ordered version of the attitudinal scale, I found fewer expected measurement error correlations among items of the same keying and in close proximity (i.e., serial order) to one another. Moreover, in this random ordering, the modification indices associated with the suggested measurement error correlations were lower than in the other orderings. Finally, the fit of the model to the data was the best in the random ordering for all except the Agreeableness scale. Practitioners are urged to administer attitudinal scales in a computer-generated random order unique to each participant whenever possible.The purpose of this study was to examine whether one of the most common standardized testing procedures, the collection of demographic information prior to testing, facilitates performance decrements in subjects for whom a negative domain performance stereotype exists. The primary investigation involved examining whether the presence of a gender identification section on an optical readable form and the request that the gender section of the form be completed was a priming stimulus sufficient to trigger a stereotype threat effect. This study provided a real world adaptation of previous stereotype threat research. Results indicate that the inclusion of a gender identification item is not a sufficient priming stimulus to trigger stereotype threat patterns in low-stakes assessments. Results do indicate, however, that the removal of such an item may increase motivation and performance for both negatively and positively stereotyped groups.
by Megan Rodgers Good, Ph.D. (2015)
Advisor: Dr. Keston H. Fulcher
To improve quality, higher education must be able to demonstrate learning improvement. To do so, academic degree programs must assess learning, intervene, and then re-assess to determine if the intervention was indeed an improvement (Fulcher, Good, Coleman, and Smith, 2014). This seemingly "simple model" is rarely enacted in higher education (Blaich & Wise, 2011). The purpose of this embedded mixed methods study was to investigate the effectiveness and experience of a faculty development program focused on a specific programmatic learning outcome. Specifically, the intervention was intended to increase students' ethical reasoning skills aligned with a university-wide program. The results suggested that this experience did indeed improve student's ethical reasoning skills. Likewise, the experience was positive for faculty participants. This study provides evidence supporting the connection of assessment and faculty development to improve student learning.
by Matthew S. Swain, Ph.D. (2015)
Advisor: Dr. Donna Sundre
Assessment practitioners in higher education face increasing demands to collect assessment and accountability data to make important inferences about student learning and institutional quality. The validity of these high-stakes decisions are jeopardized, particularly in low-stakes testing contexts, when examinees do not expend sufficient motivation to perform well on the test. This study introduced planned missingness as a potential solution. In planned missingness designs, data on all items are collected but each examinee only completes a subset of items, thus increasing data collection efficiency, reducing examinee burden, and potentially increasing data quality. The current scientific reasoning test served as the Long Form test design. Six Short Forms were created to serve as the planned missingness design which incorporated 50% missing data. Examinees mid-way through their educational career were randomly assigned to complete the test as either a planned missingness or full-form design. Multiple imputation was used to estimate parameters for both conditions. When compared to the full-form design, the planned missingness design resulted in higher group mean test performance, higher self-reported examinee motivation, and a reduction in shared variance between test-taking effort and test performance. Internal consistency coefficients and item parameter estimates were similar between the form conditions. Although the effect sizes were small for some comparisons, the implications of these results for assessment practice are substantive. This study supported the use of planned missingness designs for accurate estimation of student learning outcomes without jeopardizing psychometric quality. The synthesis of planned missingness design and examinee motivation literatures provide several opportunities for new research to improve future assessment practice
Extending an IRT Mixture Model to Detect Random Responders on Non-Cognitive Polytomously Scored Assessments
by Mandalyn R. Swanson, Ph.D. (2015)
Advisor: Dr. Dena A. Pastor
This study represents an attempt to distinguish two classes of examinees – random responders and valid responders – on non-cognitive assessments in low-stakes testing. The majority of existing literature regarding the detection of random responders in low-stakes settings exists in regard to cognitive tests that are dichotomously scored. However, evidence suggests that random responding occurs on non-cognitive assessments, and as with cognitive measures, the data derived from such measures are used to inform practice. Thus, a threat to test score validity exists if examinees' response selections do not accurately reflect their underlying level on the construct being assessed. As with cognitive tests, using data from measures in which students did not give their best effort could have negative implications for future decisions. Thus, there is a need for a method of detecting random responders on non-cognitive assessments that are polytomously scored.
This dissertation provides an overview of existing techniques for identifying low-motivated or amotivated examinees within low-stakes cognitive testing contexts including motivation filtering, response time effort, and item response theory mixture modeling, with particular attention paid to an IRT mixture model referred to in this dissertation as the Random Responders model – Graded Response model (RRM-GRM). Two studies, a simulation and an applied study, were conducted to explore the utility of the RRM-GRM for detecting and accounting for random responders on non-cognitive instruments in low-stakes testing settings. The findings from the simulation study show considerable bias and RMSE in parameter estimates and bias in theta estimates when the proportion of random responders is greater than. Use of the RRM-GRM with the same data sets provides parameter estimates with minimal to no bias and RMSE and theta estimates that are essentially bias free. The applied study demonstrated that when fitting the RRM-GRM to authentic data, 5.6% of the responders were identified as random responders. Respondents classified as random responders were found to have higher odds of being males and of having lower scores on importance of the test, as well as lower average total scores on the UMUM-15 measure used in the study. Limitations of the RRM-GRM technique are discussed
by Laura M. Williams, Ph.D. (2015)
Advisor: Dr. Donna Sundre
Questions regarding the quality of education, both in K-12 systems and higher education, are common. Methods for measuring quality in education have been developed in the past decades, with value-added estimates emerging as one of the most well-known methods. Value-added methods purport to indicate how much students learn over time as a result of their attendance at a particular school. Controversy has surrounded the algorithms used to generate value-added estimates as well as the uses of the estimates to make decisions about school and teacher quality. In higher education, most institutions used cross-sectional rather than longitudinal data to estimate valueadded. In addition, much of the data used to generate value-added estimates in higher education were gathered in low-stakes testing sessions. In low-stakes contexts, examinee motivation has been shown to impact test performance. Additionally, recent empirical evidence indicated that the change in test-taking motivation between pre-and post-test was a predictor of change in performance. Because of this, researchers have suggested that test-taking motivation may bias value-added estimates. Further, if interest truly lies in measuring student learning over time, the use of cross-sectional data is problematic, since the pre- and post-test data is gathered from two different groups of students, not the same students at two time points. The current study investigated two overarching questions related to value-added estimation in higher education: 1) are different methods of value-added estimation comparable?; and 2) how does test-taking motivation impact value-added estimates? In this study, first the results from value-added estimates calculated with cross-sectional and longitudinal data were compared. Next, estimates generated from two value-added models were compared: raw difference scores and a longitudinal hierarchical linear model. Finally, estimates were compared when motivation variables were included. Results indicated that at the institution under study, cross-sectional and longitudinal data and analyses yielded similar results and that changes in test-taking motivation between pre- and post-test did impact value-added estimates. Suggestions to combat the effect of motivation on value-added estimates included behavioral as well as statistical interventions.
by Rory Lazowski, Ph.D. (2015)
Advisor: Dr. Dena A. Pastor
This dissertation is comprised of two separate papers, both of which draw from a recent meta-analysis conducted by Lazowski and Hulleman (2015) but in distinct ways. The first paper is a more technical, methodological treatment of meta-analysis that is presented as a tutorial using illustrations based on data from this meta-analysis throughout. Meta-analysis is often lauded as an effective analytic tool to inform practice and policy, disentangle conflicting results among single studies, and identify areas that require additional information for a certain topic. However, because routine use of meta-analysis is relatively recent, there remain methodological issues that require clarity. The first paper is intended to be a tutorial to examine some of the methodological issues associated with meta-analysis. More specifically, the tutorial examines the concept of effect size use in meta-analysis, the choice of analytic technique (fixed versus random effects models using traditional approaches), comparisons of traditional approaches to a multilevel modeling, publication bias, and best practices related to the inclusion of published and unpublished literature in meta-analyses.
Next, intervention studies are a particularly important and valuable facet of educational research. The second paper examines how intervention work can be used to help inform theory, research, and policy/practice in a multitude of ways. However, despite these benefits, intervention research in the field of education has been on the decline over the past two decades (Hsieh et al., 2005; Robinson et al., 2007). The field of academic motivation research is no different. Although formal meta-analytic techniques can provide a quantitative analysis that can be useful in summarizing interventions, a narrative review can offer qualitative insight that can complement the quantitative analyses found via meta-analysis. Toward this end, I offer a more thorough narrative review of the studies included in the Lazowski and Hulleman (2015) meta-analysis. Given the conceptual overlap among the theories and constructs therein, the expectancy-value framework is proposed as a means to organize the various intervention studies. In addition, within the general categories of expectancies, values, and cost, I identify specific sources or pathways of expectancies, values, and cost that can be targeted by interventions. These sources or pathways refer to the underlying psychological processes that both serve as antecedents and that are potentially amenable to intervention by educational practitioners (Hulleman et al., in press).