EXAMINING THE ADEQUACY OF AN EXIT EXAM TO MEASURE DIPLOMA STUDENTS' ACHIEVEMENT: RASCH ANALYSIS

Purpose of the study: This paper examines the adequacy of an exit exam using the Rasch Model. It also addresses the students' achievement on the exam items according to Learning Outcomes (LOs) i.e. what LOs have been achieved and have not been achieved. Main Findings: The Rasch analyses showed‎ that there were issues related to the adequacy of the exit exam in terms of the items' validity and items' distribution along the interval scale. The items' qualitative investigation revealed that the stems and options of some items have problems. Overall, the exam was easy for the students, and students scored different achievement according to Learning Outcomes (LOs). These findings highlight the importance of using measurement ‎models to validate exams as well as to provide a more accurate interpretation of ‎students' achievement; Rasch Model is an example. Methodology: The descriptive quantitative research design was utilized to achieve the research objectives. An exam comprises 100 Multiple choice items/questions administered to 322 students taking Professional Diploma in Teaching at a College of Education. The items cover eight 8 Learning Outcomes that students were expected to achieve when completed all the Professional Diploma courses. The collected data were analyzed using the Rasch Model for dichotomous data, and Winsteps software 4.1.0 (2018). Applications of this study: The study provides insightful information to higher institutions in general and to colleges of education to revamp the implementation of diploma teaching programs, mainly the assessment methods. Novelty/Originality of this study: This paper extends the evidence of providing academic staff at higher institutions with necessary information and training on measurement to come out with more informed decisions.


INTRODUCTION
Measurement and evaluation are key components of the whole teaching and learning process as they provide information related to students' learning progress or performance (Worthen, White, Fan & Sudweeks, 1999). In most academic institutions, tests are the most common instruments used to measure students' performance and then make decisions based on their test scores. Such tests have increasingly been criticized due to shortcomings in their appropriateness in terms of preparation, selection, administration, and interpretation of the results (Worthen et al., 1999). They further asserted that there should be "some structured, reliable way to measure student performance" to ensure that students are being taught effectively. In other words, examiners should use a measurement model that helps them ensure the test appropriateness and provide more accurate/reliable interpretations of the test results in a practical way. Linacre (2003) elaborated that "the more generally applicable the model, and the more useable the results, the more it is likely to meet practical needs and form the basis for scientific progress" (p. 907). This was previously highlighted by Wright (1997), who mentioned that if our decisions were based on untrustworthy measures and divergent units, then the decisions are inaccurate. For instance, using test raw scores to determine students' performance in a specific subject is not enough as they do not reflect the intended results and provide spurious or "misleading information and distortion" (Lee, 2002;Wright, 1993a;Wright, 1999;Wright & Linacre, 1997).
The Rasch Measurement Model, named after George Rasch, a Danish mathematician, helps get more accurate and reliable measurements for students' abilities or performances (Bond & Fox, 2015;Engelhard 2000;Linacre, 2003;Wright & Stone, 1979). The model is used for assessment in psychology, education, health, and physical science. In principle, it attributes the likelihood of getting an accurate answer to a particular item to the difference between person ability and item difficulty. This means that the correct answer is dominated by item difficulty and person ability. Two propositions underlie the theoretical concept of the Rasch Measurement Model (Bond & Fox, 2015). First, skilled examinees are more likely to answer all items correctly. Second, all examinees can likely answer easier items correctly. This formula shows the probabilistic dichotomous model: Fit statistics are investigated to ensure the items are contributing meaningfully to the measured construct. The two major fit statistics (the infit and outfit Mean-square statistics) were used (Bond & Fox, 2015;Boone, 2016;Green & Frantom, 2002). The recommended range for multiple-choice items is (0.7-1.3) (Bond & Fox, 2015). Table A1 shows the infit and outfit mean square of individual items. All items were within the recommended infit and outfit mean-square range (0.7 -1.3), except two items (PR 98 and APP 43) with Outfit mean square above 1.3. The two items had issues in their writing as discussed earlier. The mean score for the infit mean square was (1.00 logit) while it was (.99 logit) for the Outfit mean square, almost the expected value of (1.00 logit). However, the standard error measurement for the individual items ranged between (.12-.32), indicating that some items might not function effectively (See Appendix A1).
Unidimensionality is used to denote that the items of a given test measure a single unidimensional construct, and it is measured by using the principal component analysis of residuals. Table A1 demonstrates that Unidimensionality is supported. However, the raw variance explained measure was low (25.7%). There is no secondary dimension since all the factors in the first and second contrasts were less than 5%. Moreover, the largest factor extracted from the residuals was equivalent to 2.67.
The reliability of the difficulty of the items was quite high at (0.99) as seen in Table A1, which indicates the possibility to replicate the ordering of item difficulty with similar groups of students. The item separation index was 8.34, indicating that the items can be divided into at least 8 difficulty levels, which is satisfactory for 100 items. However, the distribution of the items on the map showed that there were two gaps at the end and bottom parts of the scale. In addition, some clusters of items appeared in the middle ( Figure 2). The qualitative investigation showed that there were issues with items that made the majority of students not to get the correct answers. Table 1 also shows that the reliability of the examinees' ability measure was not high (0.77), which suggests that the likelihood of replicating the students ordering with other items of the same difficulty would not be high. The examinees' separation index was 1.85, showing that the examinees could be split into two levels of ability. The results showed that in general examinees were not answering as the model expected, supported by the high value (.24) for the measurement standard error (Figure 3). shows that visible gaps between items distribution were not significant. However, the upper and lower ends of the scale showed two wide gaps, indicating that the most difficult items are at the top and the easiest ones are at the bottom. Most of the items were accumulated around the mean (i.e. in the middle of the scale). This supports that either the items were not discriminating the examinees effectively, or the examinees were with narrow ability range. Qualitative investigations showed that the most difficult items placed at the upper part of the scale had issues in the stem and the options, which made the majority of students, not get the correct answers. The clustered items in the middle should be investigated to see if they were measuring almost the same things. Figure 2 clearly shows the item difficulty measures ordered from the most difficult items (PR 98 (3.56 logit) PR 94 (3.09 logit) to the easiest items (EV69 (-2.90 logit) and PR96 (-2.47 logit).

Figure 2: Item-Map
The table in appendix A2 shows the fit statistics of the examinee responses. The infit MNSQ value was 1.00 logit, the expected value of the model (1.00). The Outfit MNSQ (0.99) was close to the value expected by the model. However, the standard error was (0.24 logits). Eight students were to be found misfit as their Outfit MNSQ was above the recommended range (.7-1.3). It seems that students were not responding to items as Rasch Model expected, as depicted in Figure 3. One proposed reason for the high misfit statistics is lucky guessing by low achievers and the issues found in some items' options and stems.  In general, the results showed that the test might not be adequate to be used to describe the examinees' achievement. There are good items while many others need further qualitative investigation.

STUDENTS' ACHIEVEMENT LEVELS
The Rasch analyses were conducted to determine the students' achievement levels on the exam items as overall and according to each learning outcome that students were expected to achieve once they completed the courses taught in the professional Diploma in Teaching at a College of Education. Rasch Item and Person Maps can display the positions of Items and Persons on the same interval scales. They help to ensure which learning outcomes have been achieved and yet have not been achieved. In other words, they help to determine how much students have acquired from the courses taken in the program, and in which learning outcomes they showed higher and lower achievement. The Maps can also show the most able students placed at the upper part of the scale and the least able students placed toward the lower part of the scale.
On average, the students' ability as a group was higher than the item difficulty. The students found the exam as easy because the mean score of their ability was 0.67 logit, which is considered quite larger than the mean score of the item difficulty (0.0) ( Figure 4). The map shows that items that were correctly answered by the examinees are placed towards the lower part of the scale, while the least correctly answered, are positioned towards the upper part. Moreover, the examinee ability measures spanned about 3.19 logits (from -.99 to + 2.20) while item difficulty measures spread was about 6.46 logits (from -2.9 to + 3.56). Figure 4 also shows that most students were distributed between -.5 logit and +1 logit, and most of the students are accumulated around the middle of the scale, which means that they almost have a narrow range of ability.

Figure 4: Examinee Ability and Item Difficulty Map
Though in overall it was easy for the students to answer the exam items, it is essential to highlight that students scored different levels according to the learning outcomes that students were expected to achieve when they had finished all the diploma courses. Figure 5 shows the means of each learning outcome items and the distribution and the hierarchical order of items. The most difficult learning outcome for students was "Demonstrate professional responsibility towards their students, school, and society" (PR) (M = 0.52 logits). It is followed by the learning outcome "Plan and design an effective student- Meaning that the students had or had not achieved certain skills under each learning outcome. The students were not able to answer the questions placed at the top correctly, while it was easy for them to answer the question at the bottom correctly as displayed in Figure 5.

CONCLUSION
The Rasch analyses showed that there were few issues related to the adequacy of the exit exam in terms of the items' validity and items' distribution along the interval scale. The items' qualitative investigation revealed that the stems and options of some items have problems. Overall, the students found the exam easy, and the results showed that the mean person ability (0.67 logit) was greater than the item mean (0.0 logit). However, the majority of the students were gathered in the middle of the scale showing that students might have a narrow range of ability. Students scored different achievement according to the learning outcomes which they were expected to achieve once they completed the courses of the professional Diploma in Education. This means that they would graduate without mastering certain skills. In principle, the analysis showed a need of a measurement model to validate the items and show how much students have achieved during their study.

LIMITATION AND STUDY FORWARD
It is recommended that the existing items should be empirically examined before given to students to ensure the requirements of an accurate measurement, and academic staff and exam writers should be given sufficient training or guidelines on how to prepare and construct accurate and appropriate measurements. The Rasch maps could help the college to see what students can and cannot do because the maps display students and items on the same interval scale. Some of the good items could be added to other coming exit exams to conduct further analysis that ensures comparable exams have obtained as recommended by (Wright, 1993b). This research has its own limitations. The research only focused on the adequacy of the exit exam and students' achievement using the Rasch Model. However, the research did not determine which groups of students performed high or low in the exam. The research did not also examine the factors that might affect students' performance on the exam items such as items format, allocated time, and the number of items.

ACKNOWLEDGEMENT
The researchers would like to thank the management of the College of Education where the research was conducted for their cooperation and support given to complete the research.

AUTHORS' CONTRIBUTION
The main Author, Dr. Enas Said Abulibdeh, dealt with the conceptual design, data collection, and preparation of the manuscript. Data analysis, interpretation, and presentation of reports, preparation of the manuscript, and preparation of the final draft have been done by Dr. Kamal J I Badrasawi. Data analysis, interpretation, and presentation of reports and preparation of the final draft have been done by Prof. Noor Lide.