Table of Contents  
ORIGINAL ARTICLE
Year : 2016  |  Volume : 9  |  Issue : 2  |  Page : 204-208  

Assessing the assessors: Use of statistical tests to find out the inter examiner reliability of examiners in a post graduate medical examination


Department of Community Medicine, Armed Forces Medical College, Pune, Maharashtra, India

Date of Web Publication1-Mar-2016

Correspondence Address:
Puja Dudeja
Department of Community Medicine, Armed Forces Medical College, Pune - 411 040, Maharashtra
India
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/0975-2870.177666

Rights and Permissions
  Abstract 

Context: A robust evaluation system is the backbone of any education system. There are different methods of evaluating answer sheets in a subjective exam such as glance and grade method, scoring pattern system, and training the examiners on model answers before evaluation. However, inter-examiner reliability varies with each of these methods. Aim: The present study was done to find out the inter-examiner reliability in a postgraduate (PG) exam with a glance, grade, and assess method. Materials and Methods: This was a cross-sectional study conducted done during a PG examination in community medicine in a medical college setting. There were four independent assessors who rated 8 PG students over 36 different items in four theory exams. The examiners were blinded during evaluation of answer sheets. Statistical Analysis Used: Reliability statistics, inter- and intra-class correlation coefficients, Pearson's correlation coefficient, and Spearman's rho coefficient between assessors were calculated. Results: There was a significant difference in the mean scores (P < 0.001) of different assessors. However, there was significant reliability and correlation between the assessors (P < 0.05). Pearson's correlation coefficient varied between 0.736 and 0.893 and Spearman's rank coefficient varied between 0.741 and 0.891 this could be due to the fact that the assessors were experienced teachers. Conclusions: Glance and grade method of assessing PGs has good reliability between the assessors. However, the scores can vary significantly between different assessors.

Keywords: Correlation, evaluation, examiners, reliability


How to cite this article:
Dudeja P, Grewal VS, Mukherji S. Assessing the assessors: Use of statistical tests to find out the inter examiner reliability of examiners in a post graduate medical examination. Med J DY Patil Univ 2016;9:204-8

How to cite this URL:
Dudeja P, Grewal VS, Mukherji S. Assessing the assessors: Use of statistical tests to find out the inter examiner reliability of examiners in a post graduate medical examination. Med J DY Patil Univ [serial online] 2016 [cited 2024 Mar 28];9:204-8. Available from: https://journals.lww.com/mjdy/pages/default.aspx/text.asp?2016/9/2/204/177666


  Introduction Top


Evaluation is an integral component of any education system. [1] An ideal education system is inseparable from the evaluation system. Correct evaluation methodology develops deep learning approach, encourages problem-solving attitude and cognitive skills among the students and only a good evaluation system can determine the extent of achievement of predetermined educational objectives. When it comes to medical education, evaluation is of great relevance because here lays the responsibility of bringing out the best doctor among the best. However, this is credible only when the principles of fairness, validity, reliability, and practicability are met with. [2]

Evaluation system in community medicine varies remarkably at undergraduate (UG) and postgraduate (PG) levels. [3] It further varies between various PG institutes like All India Institute of Medical Sciences, New Delhi, Postgraduate Institute of Medical Education and Research Chandigarh, Armed Forces Medical College, Pune, All India Institute of Hygiene and Public Health Kolkata and Colleges under different universities of health sciences, etc. [4],[5],[6] However, whatever be the pattern of training, the evaluation system at the end should be a true reflection of the candidates' abilities and hence beneficial for learning. [7]

The broad heads for examination in PG community medicine have been a theory and practical exam both of equal weightage. It has long been known that essay type questions can give a misleading impression of students' real abilities. Objective assessment methods such as multiple choice questions (MCQs) have been rated as more reliable by eliminating examiner bias. [8] The subjectivity, in theory, the examination has been partly replaced by objective MCQs at UG level, but these have faced the criticism of testing only memory recall of independent facts rather than the application of knowledge and hence have not yet found any place in PG evaluation in community medicine. [9]

The subjective part of theory answer sheets is assessed through a blueprint or standard/model answer at UG level. The system of developing a specific criteria and checklists in the marking system does not exist at the PG level. The reason could be the pattern of question paper that comprise of long questions which are required to be answered in great detail by the examinee. For assessment of PG answer sheets, it is difficult to have a standard answer as there can be a variation in describing the practical application of concepts, writing critiques, handling public health issues, etc. Without a standard answer, the strong issue of reliability comes where different examiners conduct an evaluation of same students on the same answer, and this can be a cause of assessment error. Without a standard blueprint for an answer there can be situations with not only significant disagreement between examiners but also wide intra-examiner variation when the same rater would evaluate the same answer on a second occasion.

Some researchers have found that development of an analytical approach using detailed checklists improves examiner reliability. [10],[11] However, other have reported no difference between glance and grade and checklist methods of assessment. [6] There is a paucity of research in this area of evaluation. Reliability of any overall evaluation can be increased through the use of more than one rater in contrast to the traditional use of only one. With this background, the present study was conducted to evaluate inter-examiner reliability in a PG examination using four examiners.


  Materials and Methods Top


A course ending examination for PG students was conducted at the end of 3 years training in community medicine. The exam comprised of 4 theory papers of 100 marks each on four separate days. Each paper comprised of both long and short questions which students attempted over 3 h. A total of 9 PG students took this exam.

The evaluation of answer sheets for all four papers was done by four examiners separately. In other words, all answer sheets were checked by all four examiners. The sheets were coded to prevent bias. Each examiner was blinded to marks awarded by other examiners.

All the examiners who carried out the evaluation had more than 10 years of teaching experience in the specialty. An average of all four mark sheets was taken as the final assessment score of the candidate. The results were entered in Microsoft Excel Sheet 2007. Data were analyzed using SPSS 20 software (IBM Statistics). Pearson's correlation coefficient and Spearman's rank coefficient were used to assess the correlation between different assessors. Reliability was calculated using Cronbach's alpha, intra-class coefficients using mixed model. There were no ethical issues in the study.


  Results Top


A total of nine PG residents undertook the exam on completion of 3 years training in community medicine. The mean marks, maximum and minimum marks attained are given in [Table 1]. One (11%) student failed in the exam. The difference in marks from the average ranged from 22% to 19%. The mean PG teaching experience of all assessors was 28 years. Two (50%) of them were external examiners and remaining two were internal examiners. There was a statistically significant difference in overall scores allotted by four assessors (P < 0.05) [Table 2]. Nevertheless, there was a significant correlation between the assessors (P < 0.05). Pearson's correlation coefficient varied between 0.736 and 0.893 [Table 3] and spearman's rank coefficient varied between 0.741 and 0.891 [Table 4]. The reliability statistics gave a Cronbach's alpha of 0.936. Intraclass coefficients were highly significant [Table 5]. The pattern of scores of all assessors is given in [Figure 1] showing consistency between all examiners.
Figure 1: Individual assessor scores

Click here to view
Table 1: Scores of subjects, in theory, examination

Click here to view
Table 2: Comparison of mean scores of assessors using one-way ANOVA

Click here to view
Table 3: Pearson's correlation coefficient between assessors

Click here to view
Table 4: Spearman's rho coefficient between assessors

Click here to view
Table 5: Intra-class correlation coefficient

Click here to view



  Discussion Top


Evaluation is the pivot of the educational system. There is a strong belief that anyone who practices medicine can teach and thus evaluate too. This assumption is not always true. In the conduct of any exam especially PG medical examination, the evaluator should have appropriate and relevant training in medical education and assessment.

In our study, the evaluators though not formally trained in medical assessment had a vast experience of teaching and conduct of PG examination. This is also reflected in the results which though showed a significant difference in scores between assessors, displayed a high reliability between them. In this situation, it is strongly recommended that formal teacher's training in conducting and assessing the answer sheets is a critical input into a PG training program. Medical education units established in medical colleges are partially filling this gap of teachers training. However, teachers in medical institutions should also take the onus of learning and practicing new and validated techniques of medical evaluation periodically. Some noteworthy opportunities include participation in common assessment program by universities after examinations. Such opportunities should be increasingly be made available to the guide/teachers of PG students by the administration. In the long run, they may also be mandated to be compulsory.

Our results showed a significant correlation between the various examiners despite the fact that there was no existing checklist/model answer. This may be attributed to the vast experience in PG training of the evaluators. Nevertheless, an analytical system and checklist criterion is a value addition if the examiners are of varied experience. A checklist/model answer can serve as a gold standard which can be offered/shown to the examiners before subjective examination evaluation. Another method to improve reliability between assessors is to have an orientation before initiation of the assessment program. This has been recommended in various dental sciences institutes. [12],[13]

A reliable evaluation pattern in any exam is important as it affects the confidence and performance of the students. Thus, the problem of inter-examiner reliability and variability can play a negative role in the future performance of the students. Further research and improvement in this area is need of the hour.

In many teaching institutions, the glance-and-grade method is applied especially by more experienced faculty because of practicalities. However, this remains a good assessment tool only in hands of the highly experienced faculty. If used by less experienced and junior teachers inter-examiner reliability can be highly compromised. In such situations apart from checklist method, another system to reduce variability is to use cut-off scores with percentages and a grading system. As noted, the purpose of this study was to investigate the inter-examiner variability. The resulting scores are presumably an accurate reflection of student performance levels. However, a number of other situational factors could have also influenced this score so that it may not be a true picture of the student's true level like marking pattern of some examiners is very lenient while of others is strict. Interclass correlation is a composite measure of both intra-class correlation and inter-rater reliability. A high inter-rater reliability is essential if the assessment is used for pass/fail or deciding the positional merit of student and the benefits received thereof.

In the present study, good results of Pearson's and Spearman's correlation coefficients confirm low variability and high reliability. By allowing for all assessors' grading to be tested simultaneously by the use of Cronbach's alpha, we were able to reduce the selection bias. The internal consistency reliability showed a remarkably high value thus indicating strong consistency between examiners.

The strength of this study is the demonstration of the reliability of experienced examiners without using a checklist. Nevertheless, limitations exist. Most importantly, the only a small group of students took the exam. The selection of assessors was from the active, interested, and volunteer staff which could also be a source of bias. However, the four assessors' individual involvement in teaching, their experience and their knowledge of medical education varied considerably, thereby making them fairly representative of all teachers. The teachers' performance as an assessor is taken for granted and his competence as an examiner is never tested.


  Conclusion and Recommendations Top


Glance and grade method of assessing PGs has demonstrated good reliability between the assessors in the present study. However, the scores varied significantly different assessors. Medical education units established in medical colleges focus on evaluation by teachers, but their main focus is UG teaching. Proper formal, structured training for PG assessment still remains a lacuna. Clinical assessment tools such as objective structured clinical examination are already well established, but objective assessment of theory in PG evaluation is not well understood. It is recommended that a structured training module for objectively assessing long theory answers art PG level be evolved.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

 
  References Top

1.
Schiekirka S, Reinhardt D, Heim S, Fabry G, Pukrop T, Anders S, et al. Student perceptions of evaluation in undergraduate medical education: A qualitative study from one medical school. BMC Med Educ 2012;12:45.  Back to cited text no. 1
    
2.
Dippenaar H, Steinberg WJ. Evaluation of Clinical Medicine in the Final Postgraduate Examinations in Family Medicine. South African Family Practice 2008;50:67-67.  Back to cited text no. 2
    
3.
MCI. Medical Council and Indian Postgraduate Medical Education Regulations. New Delhi: Medical Council of India; 2000.  Back to cited text no. 3
    
4.
Lal S, Kumar R, Prinja S, Singh GP. Post graduate teaching and evaluation in community medicine. Indian J Prev Soc Med 2011;42:221-4.  Back to cited text no. 4
    
5.
BFUHS. Syllabus for MD in Community Medicine: Baba Farid University of Health Sciences; 2007:214-37.  Back to cited text no. 5
    
6.
PGIMER. Syllabus for MD in Community Medicine. Chandigarh: Post Graduate Institute of Medical Education and Research; 2008.  Back to cited text no. 6
    
7.
Meadows M, Billington L. A Review of the Literature on Marking Reliability. Report for the National Assessment Agency by AQA Centre for Education Research and Policy; 2005.  Back to cited text no. 7
    
8.
Vanderbilt AA, Feldman M, Wood IK. Assessment in undergraduate medical education: A review of course exams. Med Educ Online 2013;18:1-5.  Back to cited text no. 8
    
9.
Bunmi S, Aduli M, Mulcahy S, Warnecke E, Otahal P, Teague PA, et al. Inter-rater reliability: Comparison of checklist and global scoring for OSCEs. Creat Educ 2012;3:937-42.  Back to cited text no. 9
    
10.
Farooq S. High failure rate in postgraduate medical examinations - Sign of a widespread disease? J Pak Med Assoc 2005;55:214-7.  Back to cited text no. 10
    
11.
Weatherall DJ. Designing a doctor: Examining undergraduate examinations. Lancet 1991;338:37-9.  Back to cited text no. 11
    
12.
Abou-Rass M. A clinical evaluation instrument in endodontics. J Dent Educ 1973;37:22-36.  Back to cited text no. 12
    
13.
Dahlström L, Keeling SD, Fricton JR, Galloway Hilsenbeck S, Clark GM, Rugh JD. Evaluation of a training program intended to calibrate examiners of temporomandibular disorders. Acta Odontol Scand 1994;52:250-4.  Back to cited text no. 13
    


    Figures

  [Figure 1]
 
 
    Tables

  [Table 1], [Table 2], [Table 3], [Table 4], [Table 5]



 

Top
   
 
  Search
 
    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

 
  In this article
Abstract
Introduction
Materials and Me...
Results
Discussion
Conclusion and R...
References
Article Figures
Article Tables

 Article Access Statistics
    Viewed4226    
    Printed119    
    Emailed0    
    PDF Downloaded314    
    Comments [Add]    

Recommend this journal