A Guide to Objective Scrutiny of a Test: Individual Item Analysis & Overall Test Analysis in Practice
By Mohammad Mohseni Far, M.A.,
Department of English,
Faculty of Letters & Humanities,
Shahid Chamran University, Ahvaz, Iran
Email: Mmb_m2005[at]yahoo.com
Get the List of 5,400+ Translation Agencies Now! No Recurring Membership Fees!
See also:
How to Test
a Test:
Preliminary Procedures & Objective Scrutiny of a Test
Introduction
The
present study examines important characteristics of the
individual items of a test, i.e. Item Facility, Item
Discrimination, Choice Distribution as well as the
significant feature of overall test, i.e. Reliability.
In doing so, four valid vocabulary tests, each comprising
30 items (120 items in total), were administered to an adequate
number of subjects majoring in English Translation
(32 students, Shahid Chamran University, B.A. Level) as
similar as possible to those for whom it is really intended
(i.e. target population).
The whole package of 120-item test is divided into four Vocabulary
Levels Tests. They are 2000 word level, 3000 word level,
Academic vocabulary & 5000 word level respectively.
In order to better keep track (and of course, analysis)
of the whole performance of students, the items based on
the aforementioned order are numbered respectively. The
required characteristic of each item is calculated. In addition,
the value of Reliability is also computed for each particular
test.
This try-out activity of pretesting involves
administering the test for the purpose of collecting information
about the usefulness of the test itself, and for the improvement
of the test and testing procedures. Specifically speaking,
the goal of pretesting will be twofold. The first purpose
is to determine objectively the characteristics of the individual
items (item analysis). These characteristics include
item facility (IF), item discrimination
(ID) and choice distribution (CD). The second purpose
is only to determine the characteristic of Reliability.
In closing, having calculated and analyzed
the pertinent statistical values, one can either safely
eliminate the ill-constructed, malfunctional and unsuitable
items or manipulate them technically in a way to develop
an appropriate test fulfilling academic requirements.
* Choice distribution (CD) [response
frequency distribution)
Choice distribution is a technique
which helps a test developer to know how each and all of
distractors perform in a given test administration. In simple
words, choice distribution refers to the frequency with
which alternatives are selected by the examinees [the distribution
of responses given to different alternatives in a multiple-choice
item].
Choice distribution should be determined
in order to improve the test both quantitatively and qualitatively.
Thus through choice distribution, the test developer
can observe deficiencies existing in the nature of choices
and then discard or modify them. For example, if a choice
is not selected by any examinees, it implies that this distractor
does not function satisfactorily; therefore it should be
deleted.
Analysis:
- Non-highlighted rows
They refer to those items correctly responded by all
testees. Technically speaking, the distractors in aforementioned
items have not functioned well so they need to be modified
or eliminated.
- Highlighted rows
(boxes in white)
They take in boxes with given frequency of being selected
by the test takers. Although some of distractors/alternative
out of them have worked well (i.e. those wrong choices
which have been selected by the examinees), there still
seems some malfunctioned distractors required to be
modified or deleted (i.e. those which have not been
selected).
Choice Distribution
The 2000 word level

Choice Distribution
The 3000 word level

Choice Distribution
Academic vocabulary

Choice Distribution
The 5000 word level

* Item facility (IF) [facility value,
item easiness, item difficulty]
It refers to a measure of the ease of a
test item. Item facility has to do with how easy or difficult
an item is from the viewpoint of the group of students or
examinees taking the test of which that item is a part.
The reason for concern with IF is very simple; a test item
that is too easy (say, an item that every student answers
correctly) or a test item that is too difficult (one, say,
that every student answers incorrectly) can tell us nothing
about the differences in ability with the test population;
so it should be deleted.
A formula for producing decimal
value for IF:

Item with facility indexes beyond 0.63 are
too easy, and items with facility indexes below 0.37 are
too difficult; thus it should be deleted.
Item facility refers to the proportion of
correct responses, while item difficulty refers to the proportion
of wrong responses.

Item Facility (IF)
|
Item |
Item
Facility (IF) |
Item |
Item
Facility (IF) |
Item |
Item
Facility (IF) |
| 1 |
1 |
41 |
1 |
81 |
0.93 |
| 2 |
1 |
42 |
0.81 |
82 |
0.9 |
| 3 |
1 |
43 |
1 |
83 |
1 |
| 4 |
1 |
44 |
1 |
84 |
0.81 |
| 5 |
1 |
45 |
1 |
85 |
1 |
| 6 |
1 |
46 |
0.06 |
86 |
1 |
| 7 |
1 |
47 |
0.06 |
87 |
1 |
| 8 |
1 |
48 |
0.93 |
88 |
0.31 |
| 9 |
1 |
49 |
0.65 |
89 |
0.75 |
| 10 |
1 |
50 |
0.75 |
90 |
1 |
| 11 |
1 |
51 |
1 |
91 |
1 |
| 12 |
1 |
52 |
1 |
92 |
0.90 |
| 13 |
1 |
53 |
1 |
93 |
0.03 |
| 14 |
1 |
54 |
1 |
94 |
0.84 |
| 15 |
1 |
55 |
0.50 |
95 |
0.03 |
| 16 |
1 |
56 |
0.06 |
96 |
0.03 |
| 17 |
1 |
57 |
0.18 |
97 |
1 |
| 18 |
0.90 |
58 |
1 |
98 |
0.62 |
| 19 |
1 |
59 |
1 |
99 |
0.03 |
| 20 |
1 |
60 |
1 |
100 |
0.93 |
| 21 |
1 |
61 |
0.93 |
101 |
1 |
| 22 |
0.81 |
62 |
1 |
102 |
0.71 |
| 23 |
1 |
63 |
1 |
103 |
1 |
| 24 |
0.81 |
64 |
1 |
104 |
0.12 |
| 25 |
1 |
65 |
1 |
105 |
1 |
| 26 |
1 |
66 |
1 |
106 |
0.06 |
| 27 |
1 |
67 |
0.87 |
107 |
0.25 |
| 28 |
0.84 |
68 |
0.09 |
108 |
0.81 |
| 29 |
1 |
69 |
1 |
109 |
0.06 |
| 30 |
1 |
70 |
1 |
110 |
0.28 |
| 31 |
1 |
71 |
1 |
111 |
0.37 |
| 32 |
1 |
72 |
0.40 |
112 |
1 |
| 33 |
1 |
73 |
1 |
113 |
1 |
| 34 |
1 |
74 |
1 |
114 |
1 |
| 35 |
1 |
75 |
1 |
115 |
0.09 |
| 36 |
1 |
76 |
1 |
116 |
0.50 |
| 37 |
1 |
77 |
0.09 |
117 |
1 |
| 38 |
1 |
78 |
1 |
118 |
1 |
| 39 |
1 |
79 |
1 |
119 |
1 |
| 40 |
0.40 |
80 |
1 |
120 |
0.40 |
* Items are numbered respectively in total
(2000/3000/Academic/5000 word level)
* Item discrimination (item differentiation)
It refers to the notion that how well a
test item discriminates between weak (less knowledgeable)
and strong (more knowledgeable) examinees in the ability
being tested. There is a relationship between item facility
and item discrimination. An item with a too high or low
facility index is not likely to have a discrimination power.
A suitable procedure for calculating ID
is to rank the total scores of test takers from the highest
to the lowest. Then, dividing examinees into two equal groups
(the higher half (high group/H) and lower half (low group/L).
At last, apply this formula:

CH: number of correct responses
to a particular item by the examinee in the high group
CL: number of correct responses to a particular
item by the examinee in the low group
In contrast to item facility where the ideal
index is 0.50, for item discrimination the ideal index is
unity (1). Nevertheless, items which show discrimination
value beyond 0.40 can be considered acceptable. An item
discriminates in a positive direction (positive discrimination)
if more test takers in the upper group than the lower group
get the item right.
Item Discrimination (ID)
|
Item |
Item
Discr.(ID) |
Item |
Item
Discr.(ID) |
Item |
Item
Discr.(ID) |
|
1 |
0 |
41 |
0 |
81 |
0.33 |
|
2 |
0 |
42 |
0.37 |
82 |
0.18 |
|
3 |
0 |
43 |
0 |
83 |
0 |
|
4 |
0 |
44 |
0 |
84 |
0.37 |
|
5 |
0 |
45 |
0 |
85 |
0 |
|
6 |
0 |
46 |
0.12 |
86 |
0 |
|
7 |
0 |
47 |
0.12 |
87 |
0 |
|
8 |
0 |
48 |
0.33 |
88 |
0.62 |
|
9 |
0 |
49 |
0.31 |
89 |
0.50 |
|
10 |
0 |
50 |
0.50 |
90 |
0 |
|
11 |
0 |
51 |
0 |
91 |
0 |
|
12 |
0 |
52 |
0 |
92 |
0.18 |
|
13 |
0 |
53 |
0 |
93 |
0.06 |
|
14 |
0 |
54 |
0 |
94 |
0.31 |
|
15 |
0 |
55 |
1 |
95 |
0.06 |
|
16 |
0 |
56 |
0.12 |
96 |
0.06 |
|
17 |
0 |
57 |
0.37 |
97 |
0 |
|
18 |
0.18 |
58 |
0 |
98 |
0.75 |
|
19 |
0 |
59 |
0 |
99 |
0.06 |
|
20 |
0 |
60 |
0 |
100 |
0.12 |
|
21 |
0 |
61 |
0.33 |
101 |
0 |
|
22 |
0.37 |
62 |
0 |
102 |
0.43 |
|
23 |
0 |
63 |
0 |
103 |
0 |
|
24 |
0.37 |
64 |
0 |
104 |
0.25 |
|
25 |
0 |
65 |
0 |
105 |
0 |
|
26 |
0 |
66 |
0 |
106 |
0.12 |
|
27 |
0 |
67 |
0.25 |
107 |
0.50 |
|
28 |
0.31 |
68 |
0.18 |
108 |
0.37 |
|
29 |
0 |
69 |
0 |
109 |
0.12 |
|
30 |
0 |
70 |
0 |
110 |
0.56 |
|
31 |
0 |
71 |
0 |
111 |
0.25 |
|
32 |
0 |
72 |
0.81 |
112 |
0 |
|
33 |
0 |
73 |
0 |
113 |
0 |
|
34 |
0 |
74 |
0 |
114 |
0 |
|
35 |
0 |
75 |
0 |
115 |
0.18 |
|
36 |
0 |
76 |
0 |
116 |
1 |
|
37 |
0 |
77 |
0.18 |
117 |
0 |
|
38 |
0 |
78 |
0 |
118 |
0 |
|
39 |
0 |
79 |
0 |
119 |
0 |
|
40 |
0.81 |
80 |
0 |
120 |
0.81 |
Reliability
A quality of test scores which refers to
the consistency of measures across different times, test
forms, raters and other characteristics of the measurement
context. Synonyms for reliability are: dependability,
stability, consistency, predictability and accuracy. To
put another way, the tendency toward consistency from one
set of measurement to the next is called reliability. In
doing so, reliability is best defined as the consistency
of scores produced by a given test.
KR-21 method:
this formula is based on the assumption that all items in
a test are designed to measure a single trait. Due to application
of purely statistical procedure, the method is sometimes
called rational equivalence.

K:
the number of the items in a test
X: the mean score
V: the variance
* Summarized Data:
Number of subjects: 32
Four Vocabulary Levels Tests: 2000
/ 3000 / Academic
/ 5000 word level
Maximum mark in each level test: 30
Total mark out of four tests (maximum): 120
| Subject |
2000 |
3000 |
Academic |
5000 |
Total |
| 1 |
30 |
26 |
26 |
22 |
104 |
| 2 |
30 |
27 |
25 |
19 |
101 |
| 3 |
30 |
23 |
25 |
19 |
97 |
| 4 |
28 |
26 |
29 |
22 |
105 |
| 5 |
28 |
25 |
27 |
21 |
101 |
| 6 |
30 |
25 |
27 |
13 |
95 |
| 7 |
30 |
26 |
27 |
18 |
101 |
| 8 |
30 |
30 |
29 |
24 |
113 |
| 9 |
28 |
28 |
27 |
20 |
103 |
| 10 |
30 |
22 |
22 |
18 |
92 |
| 11 |
30 |
28 |
27 |
24 |
109 |
| 12 |
30 |
24 |
24 |
13 |
91 |
| 13 |
30 |
28 |
30 |
25 |
113 |
| 14 |
29 |
28 |
28 |
20 |
105 |
| 15 |
28 |
23 |
27 |
17 |
95 |
| 16 |
29 |
22 |
27 |
16 |
94 |
| 17 |
29 |
25 |
29 |
19 |
102 |
| 18 |
30 |
27 |
29 |
23 |
109 |
| 19 |
29 |
28 |
26 |
17 |
100 |
| 20 |
30 |
29 |
29 |
28 |
116 |
| 21 |
27 |
26 |
27 |
22 |
102 |
| 22 |
30 |
26 |
29 |
19 |
104 |
| 23 |
30 |
30 |
29 |
24 |
113 |
| 24 |
30 |
28 |
28 |
15 |
101 |
| 25 |
30 |
27 |
24 |
14 |
95 |
| 26 |
28 |
24 |
30 |
18 |
100 |
| 27 |
30 |
23 |
22 |
17 |
92 |
| 28 |
30 |
28 |
27 |
20 |
105 |
| 29 |
29 |
23 |
27 |
15 |
94 |
| 30 |
28 |
29 |
28 |
26 |
111 |
| 31 |
30 |
29 |
27 |
30 |
116 |
| 32 |
30 |
29 |
30 |
24 |
113 |
Summarized Statistical Calculations
Mean – Variance – Reliability
* The 2000 word level
Mean = 29.37
Variance = 0.82
K = 30

* The 3000 word level
Mean = 26.31
Variance = 5.64
K = 30

* Academic vocabulary
Mean = 27.12
Variance = 4.37
K = 30

* The 5000 word level
Mean = 20.06
Variance = 17.86
K = 30

| |
2000 |
3000 |
Academic |
5000 |
| Calculated
Reliability |
0.25 |
0.43 |
0.42 |
0.64 |
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice counts!
|