#### A Guide to Objective Scrutiny of a Test: Individual Item Analysis & Overall Test Analysis in Practice

Introduction

The present study examines important characteristics of the individual items of a test, i.e. Item Facility, Item Discrimination, Choice Distribution as well as the significant feature of overall test, i.e. Reliability. In doing so, four valid vocabulary tests, each comprising 30 items (120 items in total), were administered to an adequate number of subjects majoring in English Translation (32 students, Shahid Chamran University, B.A. Level) as similar as possible to those for whom it is really intended (i.e. target population).

The whole package of 120-item test is divided into four Vocabulary Levels Tests. They are 2000 word level, 3000 word level, Academic vocabulary & 5000 word level respectively. In order to better keep track (and of course, analysis) of the whole performance of students, the items based on the aforementioned order are numbered respectively. The required characteristic of each item is calculated. In addition, the value of Reliability is also computed for each particular test.

This try-out activity of pretesting involves administering the test for the purpose of collecting information about the usefulness of the test itself, and for the improvement of the test and testing procedures. Specifically speaking, the goal of pretesting will be twofold. The first purpose is to determine objectively the characteristics of the individual items (item analysis). These characteristics include item facility (IF), item discrimination (ID) and choice distribution (CD). The second purpose is only to determine the characteristic of Reliability.

In closing, having calculated and analyzed the pertinent statistical values, one can either safely eliminate the ill-constructed, malfunctional and unsuitable items or manipulate them technically in a way to develop an appropriate test fulfilling academic requirements.

* Choice distribution (CD) [response frequency distribution)

Choice distribution is a technique which helps a test developer to know how each and all of distractors perform in a given test administration. In simple words, choice distribution refers to the frequency with which alternatives are selected by the examinees [the distribution of responses given to different alternatives in a multiple-choice item].

Choice distribution should be determined in order to improve the test both quantitatively and qualitatively. Thus through choice distribution, the test developer can observe deficiencies existing in the nature of choices and then discard or modify them. For example, if a choice is not selected by any examinees, it implies that this distractor does not function satisfactorily; therefore it should be deleted.

Analysis:

• Non-highlighted rows

They refer to those items correctly responded by all testees. Technically speaking, the distractors in aforementioned items have not functioned well so they need to be modified or eliminated.

• Highlighted rows (boxes in white)

They take in boxes with given frequency of being selected by the test takers. Although some of distractors/alternative out of them have worked well (i.e. those wrong choices which have been selected by the examinees), there still seems some malfunctioned distractors required to be modified or deleted (i.e. those which have not been selected).

Choice Distribution

The 2000 word level

Choice Distribution

The 3000 word level

Choice Distribution

Choice Distribution

The 5000 word level

* Item facility (IF) [facility value, item easiness, item difficulty]

It refers to a measure of the ease of a test item. Item facility has to do with how easy or difficult an item is from the viewpoint of the group of students or examinees taking the test of which that item is a part. The reason for concern with IF is very simple; a test item that is too easy (say, an item that every student answers correctly) or a test item that is too difficult (one, say, that every student answers incorrectly) can tell us nothing about the differences in ability with the test population; so it should be deleted.

A formula for producing decimal value for IF:

Item with facility indexes beyond 0.63 are too easy, and items with facility indexes below 0.37 are too difficult; thus it should be deleted.

Item facility refers to the proportion of correct responses, while item difficulty refers to the proportion of wrong responses.

Item Facility (IF)

 Item Item Facility (IF) Item Item Facility (IF) Item Item Facility (IF) 1 1 41 1 81 0.93 2 1 42 0.81 82 0.9 3 1 43 1 83 1 4 1 44 1 84 0.81 5 1 45 1 85 1 6 1 46 0.06 86 1 7 1 47 0.06 87 1 8 1 48 0.93 88 0.31 9 1 49 0.65 89 0.75 10 1 50 0.75 90 1 11 1 51 1 91 1 12 1 52 1 92 0.90 13 1 53 1 93 0.03 14 1 54 1 94 0.84 15 1 55 0.50 95 0.03 16 1 56 0.06 96 0.03 17 1 57 0.18 97 1 18 0.90 58 1 98 0.62 19 1 59 1 99 0.03 20 1 60 1 100 0.93 21 1 61 0.93 101 1 22 0.81 62 1 102 0.71 23 1 63 1 103 1 24 0.81 64 1 104 0.12 25 1 65 1 105 1 26 1 66 1 106 0.06 27 1 67 0.87 107 0.25 28 0.84 68 0.09 108 0.81 29 1 69 1 109 0.06 30 1 70 1 110 0.28 31 1 71 1 111 0.37 32 1 72 0.40 112 1 33 1 73 1 113 1 34 1 74 1 114 1 35 1 75 1 115 0.09 36 1 76 1 116 0.50 37 1 77 0.09 117 1 38 1 78 1 118 1 39 1 79 1 119 1 40 0.40 80 1 120 0.40

* Items are numbered respectively in total (2000/3000/Academic/5000 word level)

* Item discrimination (item differentiation)

It refers to the notion that how well a test item discriminates between weak (less knowledgeable) and strong (more knowledgeable) examinees in the ability being tested. There is a relationship between item facility and item discrimination. An item with a too high or low facility index is not likely to have a discrimination power.

A suitable procedure for calculating ID is to rank the total scores of test takers from the highest to the lowest. Then, dividing examinees into two equal groups (the higher half (high group/H) and lower half (low group/L). At last, apply this formula:

CH: number of correct responses to a particular item by the examinee in the high group
CL: number of correct responses to a particular item by the examinee in the low group

In contrast to item facility where the ideal index is 0.50, for item discrimination the ideal index is unity (1). Nevertheless, items which show discrimination value beyond 0.40 can be considered acceptable. An item discriminates in a positive direction (positive discrimination) if more test takers in the upper group than the lower group get the item right.

Item Discrimination (ID)

 Item Item Discr.(ID) Item Item Discr.(ID) Item Item Discr.(ID) 1 0 41 0 81 0.33 2 0 42 0.37 82 0.18 3 0 43 0 83 0 4 0 44 0 84 0.37 5 0 45 0 85 0 6 0 46 0.12 86 0 7 0 47 0.12 87 0 8 0 48 0.33 88 0.62 9 0 49 0.31 89 0.50 10 0 50 0.50 90 0 11 0 51 0 91 0 12 0 52 0 92 0.18 13 0 53 0 93 0.06 14 0 54 0 94 0.31 15 0 55 1 95 0.06 16 0 56 0.12 96 0.06 17 0 57 0.37 97 0 18 0.18 58 0 98 0.75 19 0 59 0 99 0.06 20 0 60 0 100 0.12 21 0 61 0.33 101 0 22 0.37 62 0 102 0.43 23 0 63 0 103 0 24 0.37 64 0 104 0.25 25 0 65 0 105 0 26 0 66 0 106 0.12 27 0 67 0.25 107 0.50 28 0.31 68 0.18 108 0.37 29 0 69 0 109 0.12 30 0 70 0 110 0.56 31 0 71 0 111 0.25 32 0 72 0.81 112 0 33 0 73 0 113 0 34 0 74 0 114 0 35 0 75 0 115 0.18 36 0 76 0 116 1 37 0 77 0.18 117 0 38 0 78 0 118 0 39 0 79 0 119 0 40 0.81 80 0 120 0.81

Reliability

A quality of test scores which refers to the consistency of measures across different times, test forms, raters and other characteristics of the measurement context. Synonyms for reliability are: dependability, stability, consistency, predictability and accuracy. To put another way, the tendency toward consistency from one set of measurement to the next is called reliability. In doing so, reliability is best defined as the consistency of scores produced by a given test.

• KR-21 method: this formula is based on the assumption that all items in a test are designed to measure a single trait. Due to application of purely statistical procedure, the method is sometimes called rational equivalence.

K: the number of the items in a test
X: the mean score
V: the variance

* Summarized Data:

Number of subjects: 32
Four Vocabulary Levels Tests: 2000 / 3000 / Academic / 5000 word level
Maximum mark in each level test: 30
Total mark out of four tests (maximum): 120

 Subject 2000 3000 Academic 5000 Total 1 30 26 26 22 104 2 30 27 25 19 101 3 30 23 25 19 97 4 28 26 29 22 105 5 28 25 27 21 101 6 30 25 27 13 95 7 30 26 27 18 101 8 30 30 29 24 113 9 28 28 27 20 103 10 30 22 22 18 92 11 30 28 27 24 109 12 30 24 24 13 91 13 30 28 30 25 113 14 29 28 28 20 105 15 28 23 27 17 95 16 29 22 27 16 94 17 29 25 29 19 102 18 30 27 29 23 109 19 29 28 26 17 100 20 30 29 29 28 116 21 27 26 27 22 102 22 30 26 29 19 104 23 30 30 29 24 113 24 30 28 28 15 101 25 30 27 24 14 95 26 28 24 30 18 100 27 30 23 22 17 92 28 30 28 27 20 105 29 29 23 27 15 94 30 28 29 28 26 111 31 30 29 27 30 116 32 30 29 30 24 113

Summarized Statistical Calculations

Mean – Variance – Reliability

* The 2000 word level

Mean = 29.37
Variance = 0.82
K = 30

* The 3000 word level

Mean = 26.31
Variance = 5.64
K = 30

Mean = 27.12
Variance = 4.37
K = 30

* The 5000 word level

Mean = 20.06
Variance = 17.86
K = 30

 2000 3000 Academic 5000 Calculated Reliability 0.25 0.43 0.42 0.64

