Chapter 14

CW14.1 How to write bad multiple-choice questions

Example 1

Here is a passage about the ‘Sad Story’ you met in Box 7.18. The multiple-choice question below it is a bad one. Why?

Naiman et al. (1978) make mention of one girl in Canada who had many qualities that suggested she should be a good foreign language learner. Her teachers described her as ‘attentive, very jovial and relaxed, obviously enjoying some of the activities’; French was, she said, her favourite subject, and outside the classroom she would listen to French radio and talk to her friends in French, just for fun. She was not afraid of making mistakes, was very motivated and worked hard. Everything is there to make her a good learner, except that she lacked something – talent (aptitude), perhaps. As a result, she was second-lowest in her class.

This passage is about:

a girl studying French in Canada
a highly motivated language learner
a person possibly lacking in language aptitude
a poor language learner

Example 2

This is taken from Heaton (1988: 32). It is also used by Alderson et al. (1995: 49).

Select the option closest in meaning to the word in bold:

He began to choke while he was eating the fish.

die
cough and vomit
be unable to breathe because of something in his windpipe
grow very angry

What is wrong with this question? Alderson et al. give three reasons, which you will find at the end of this page.

Example 2 can be used to illustrate three pieces of technical terminology. The sentence itself (He began to choke …) is known as the stem. Then there is the correct option (in this case probably intended to be C) and the distractors (options A, B and D).

Try writing some good multiple-choice questions for the ‘Sad Story’ passage above.

Alderson et al’s reasons

The multiple-choice question in Example 1 is bad because all the alternatives given are right! You need to be sure that although alternatives may contain elements of truth, there is one – and only one – right answer.

Alderson et al. point out first that answer C in Example 2 is much longer than the other answers and looks like a dictionary definition. A candidate who wanted to guess might well choose it. Second, answer B does give a passable definition of choke, and a learner could be forgiven for choosing it. Third, we do not in fact know that the man did not choke through anger. Though it is likely that a fish bone caused the problem, something we do not know about might have made him angry over his dinner.

CW14.2 A C-Test

In this example (from a passage in section 1.5), exactly half of each second word is deleted when the full word has an even number of letters. Exactly half plus one are omitted when the word has an odd number of letters. This follows the example given in Alderson et al. (1995: 56).

In Classroom 2, the teenage English pupils are learning Italian. The book they are using is full of extremely complicated grammar rules, explained in English. The tea___ spends so___ twenty min___ of t___ lesson expla___ the gra___ point o___ the d___, in Eng___, using diag___ on t___ blackboard, a___ plenty o___ grammatical te___ (talking o___ ‘tenses’, ‘no___ ‘, ‘adv___’ and t___ like).

If you’d like to try your hand at constructing a C-Test, do so using another passage from this book. Try your test out on a friend.

CW14.3 Indirect and direct tests

A distinction is made in testing circles between indirect and direct tests. It is the same difference that was discussed in 12.4.1 and 12.5.1 when we spoke about indirect and direct relationships to the ‘terminal behaviour’ in scales and real-thing practice. Indirect tests measure abilities thought to be important to the terminal behaviour, but they do not attempt to simulate the behaviour itself. One of Hughes’s (1989) examples of indirect testing is Lado’s (1961) method of testing pronunciation by a paper and pencil test where pairs of words which rhyme have to be identified. Aspiring poets apart, people do not usually spend much time identifying rhyming words as part of their everyday routine. Direct tests try as far as possible to replicate the actual terminal behaviour in the test itself.

Baker (1989: 13) gives an example of what might be called ‘direct testing’ from the Oxford Syndicate’s Preliminary Test in English. It involves using an English–English dictionary to help solve reading problems. In the relevant part of the test, learners are given two dictionary entries, for the words get and give. These are quite lengthy and cover many uses of these words. As is usual in dictionaries, each use is numbered, with a definition and an example given. So use (1) of get is ‘have something’ (as in Peter’s got a huge house), while use (2) is ‘buy or procure’ (as in I’ll get some wine from the supermarket on the way home). Learners are given 10 sentences containing one of the two verbs. They are asked to write beside each sentence the number corresponding to the use as it appears in the dictionary. So, to continue the example above, a sentence like Don’t forget to get some sugar at the shop would have the number (2) written beside it.

This part of the test is ‘direct’ because using a dictionary in this way is an activity that many learners will undertake regularly.

CW14.4 Norms and criteria

A norm-referenced test is one in which the learner’s score is measured against others who have taken the same test. The learner may be given a ‘position’ or ‘ordering’ in relation to the others; they might have come third (or tenth) in the class, for example, or be in the top (or bottom) ten per cent. Many of the big, internationally recognized tests operate in a norm-referenced way. Even though an actual ordering of candidates in relation to each other may not be stated, the outside world develops an understanding of how to interpret test scores. So, for example, a score of 40% on a given test will over time come to be meaningful to the world outside.

Norm-referenced tests do not make any direct statement about what a learner will be able to ‘do’ in terms of language or its uses. Tests which do this are criterion-referenced. They are the ones that are usually based on some form of needs analysis and a resulting syllabus specification. The results of the test are expressed in terms of whether the learner has mastered the content specification. So a pass will mean that the learner can do X, Y and Z – and a fail will mean that they cannot.

CW14.5 Empirical and consequential validity

Empirical validity (or criterion-related validity as some call it) deals with how the test relates to other testing measures. A test should not yield results that are dramatically at odds with the results of other forms of assessment. These ‘other forms’ include teacher perceptions of learner achievement, and though teachers will be prepared for the odd surprise when their learners’ test results are known, they will rightly be suspicious if these results conflict dramatically at every turn with their own judgements. It will be important for a new test to establish its empirical validity early on. One way of doing this will be to use it together with another well-established test to see whether the results compare. A test may also be said to have predictive validity where its results yield some information about the future. It may be particularly useful to find out whether placement tests, for example, have any predictive validity. Once a course has been run, you can go back and find out whether the placement test did in fact (in the light of subsequent learner performance) lead to the ‘right’ placement decisions being made.

Another form of validity which has attracted recent interest is consequential validity, a notion developed particularly by Messick (1989). It is to do with the pedagogical or social consequences a test might have. Some of these may be intentional, others unintentional; some will be positive, others negative. The notion of washback was discussed in section 14.1, where it was suggested that a test may have the unfortunate consequence of controlling the teaching to an undesirable extent (as teachers ‘teach to the test’ – excluding all content which is not relevant to it). Other more serious consequences might be that a test score is used to assess someone’s suitability for a job, which might go quite beyond the intentions of the test developers. Of course, bad consequences of a test may not be the fault of the test designers at all but of the misguided practitioners who use it. Nevertheless, testers should, it can be argued, at least be aware of the effects their tests are having.

CW14.6 How to be reliable

These suggestions on test reliability are adapted from Hughes (1989):

Take enough samples of behaviour.
Do not allow candidates too much freedom.
Write unambiguous items.
Provide clear and explicit instructions.
Ensure that tests are well laid out and perfectly legible.
Candidates should be familiar with format and testing techniques.
Provide uniform and non-distracting conditions of administration.
Use items that permit scoring which is as objective as possible, make comparisons between candidates as direct as possible, and/or provide a detailed scoring key.
Train scorers.
Agree acceptable responses and appropriate scores at the outset of scoring.
Identify candidates by number, not name.
Employ multiple, independent scoring.

CW14.7 A battery for academic French

Alderson et al. (1995: 14) invent an example of a test specification for part of a test of French for postgraduate studies. Their specification appears under eight headings. Here are six of them:

General statement of purpose

This describes the test battery (collection of tests) in general terms, saying who it is for and what its aims are.

The test battery

The tests making up the battery are described, with details like the length of each test given and how they will be administered and scored.

From this point onward, the specification focuses on the reading test alone:

Test focus

The level of reading at which the test will operate is here stated. The skills to be focused on are also stated. These include such skills as distinguishing fact from opinion and understanding the communicative function of sentences and paragraphs.

Source of texts

Much detail is given concerning the nature of the texts to be used. This includes details of the content areas they should cover, statements about whether they should be completely authentic (or modified a little for testing purposes) and their length.

Item types

The number of items to be included in the test is stated, together with guidelines about what type of items they should be. A list of possible item types is given. This includes identifying appropriate headings, labelling or completing diagrams, tables, charts etc. and gap filling.

Rubrics

This gives guidelines about what instructions should be given to the learner, including the language level of these instructions. (It would, of course, be silly to have instructions for doing the test that were linguistically more complex than the level being tested.)

CW14.8 A choice between evils

Imagine an oral test being given to a group of students on a visit to Britain. Below are three (invented) answers to one of the tester’s questions:

Tester:	How long have you been in Britain?
Student A:	Another three months.
Student B:	I been here since two week.
Student C:	It very cold and rain.

Describe each answer, saying how it is right and wrong. How would you characterize the difference between the three answers? And the Big Questions: how would you mark these responses? Which answer is ‘least bad’, and which is the worst? Once you have decided this, try to write general instructions to markers to ensure that in this particular case the least bad answer gets a better mark than the worst.

Do all this before you read the following discussion.

Student A’s answer is perfectly correct grammatically. They seem to have misunderstood the question, thinking they have been asked How long will you be in Britain?

Student B has understood the question, but their answer has three mistakes in it. Perhaps they think that the tense needed is the simple past (which is wrong), and they have formed it by saying been instead of was. Or they have simply misformed the present perfect tense (the right one to use), leaving out the auxiliary have. The tense in sentences like this presents many students with problems, partly because in some languages the present tense would be used – *I am here since … Second, this student has misused the word since. The words for and since are confusing for many students, because their own languages do not mark the distinction these words signal. Since is used with a ‘point in time’ (since Monday, since 1970), while for is used with a ‘period of time’ (for a week, for three years). So for should be used here. Finally, Student B commits the common error of missing the final –s off the plural weeks.

As for Student C … They have first of all misunderstood the question completely, perhaps thinking that since the British have a reputation for talking about the weather, any question they are asked will be on that topic. They also omit the verb be – a common mistake. Their final sin is to assume that rain is an adjective as well as a noun. Rainy would be the appropriate word to use here.

Which answer is ‘least bad’, and which is worst? If an important criterion for you is whether messages have been properly conveyed, then Student B is the only one who understands what they have been asked. According to this criterion, their answer would be the ‘least bad’. Student A at least knows that they are being asked about a period of time. Worst from the comprehension point of view is Student C, who does not even understand the general topic area of the question. From the point of view of grammar, Student A does not make any mistakes. Both Students B and C make more than one mistake, with B perhaps producing the grammatically least accurate sentence.

But is grammar more important that understanding messages, or vice versa? There is no simple answer. Many would argue that the latter should be given more attention, in which case you would need to instruct your markers accordingly. Notice that in this case ‘misunderstanding the message’ is to do with listening not speaking. If you want your test to deal specifically with speaking, then you have to decide to what extent you should consider the ability to understand.