.0‘00 . I 0.0: 503 E'MFAUR I...I c %-xﬂ' A. * on... T'Al-...::..,. 4... v--1 ~. H.l..\"O...

$: .0‘00 . I 0.0: 503 E'MFAUR I...I c %-xﬂ' A. * on... T'Al-...::..,. 4... v--1 ~. H.l..\"O "-‘—24‘I -Lc I 0 o o. I a.... >. o-.?.-.-: T: ‘L ‘ a "\I-not“ M‘m.tp.q“’$
I‘.ntc

3.333%

Zzir

.>.'

...;

.

-..

-

-xfl'

A.*

on...

T'Al-

..

.

’-v-I~.:3‘:

I.

.,

..

4‘Z‘..-.A_\c'.

.‘v

—-

--

.1

~..H .l

..\"O"-‘—24‘I

-Lc

‘.I

0o

oA.II«a

..

.

«as...

>.

.,..o

E'Z?

I~1~

"-"‘

-fr.4%?

-.-

.-

.*I~‘..Iob':

wx‘tI‘I.:;~T

:

‘L‘~va

"\I-not“

M‘m.tp.q“’

at-l

.‘L

v'O

o.

..

. ..

~3

‘-

:53?“:IIxQ"'.I

'~”1~to‘%w*m

.

Ari-735'

.Af’fi‘f‘fi.“"and.‘

‘.‘

."«‘ALIIsrfi‘0!

‘¢?LI.:.’

15-i,

|.

II

I

3‘13‘.‘

"‘."I

"*3Mn,

#33"

.I.~

~

,3,AA

«A;,yvr.;...A:.z

..I

.IFQI('.Ih'

"I?3%?

*é‘w‘

I'

I.;}.I

..Wit-Egg.

.‘

‘Wfifr1\;0|'::'l‘.1

c.

AI\INI

I.

:AI‘..L::

’.L‘u-‘

“.

“I“.M“afl:"uhI

A..IQ.'V

.111...

‘53.},

,(51.2

figudysifiaw

..

.,

1'

II"

w]an

I‘3.

.4'.

w,‘

"‘V?‘I‘I“.1’r

II

‘0

A‘.3'

.

”a?

ti,\:

I.

..

ul

I,”

v,

I.

new

-.

A.

O

f?%'

”

‘X‘CA

.I1x.

?.urIf‘AIIAIMI

-A

F.’

u3‘wm'"A

If“?

.’

‘IIHII'E‘A

YIN"

I‘MIII

p”

‘:I;';!..5.A.

'00

A

D.‘_

.

{AIM

‘j'VUI'fioI‘u‘lug

.IHII'II:'

lko'ghu

I

u“I?

”mi:

“u‘

,v

..O’IIN“,Q‘a

.I‘AII..:-'\;.AI"

.I

IL1:\.AII"‘III-.

»‘.0;“"3.;II

-'-cOI

''.IW

‘v

||.

I‘.

v.'

.',

Ln

IJII

..IA

J-JIIA

v‘O

fifl'l

flflaun.»wt.1

..v.

":.'.|Ino»\

W‘...‘

'‘v‘lfl

wwiIILAIM

NH

Q‘s‘quAA.

I‘If".§II-‘I

‘EIAI.;.I

..

3w:we}?

w,

‘.I-*-I

III»

...W5

,3"??? {mm

W}...

.;.18”,

WNW

$1.1".

\J

0

MA!

'V.

‘u.INL'fi'Q.‘

‘.I”or

:rA

W-

’.~In»?C’AA-rlnu-AAAL.

f -‘ér§zs§s ICir ti 9

O

A. 0

K

c

Eegfeie' I}. I?! .-

A

A

a

o

I

a0'.

‘1‘

'AVJIJDU‘“‘_~“F‘_1""

.III..."K."’

:.

\.I

IuIA‘J'I?‘

\.

III‘I'NJH

\nl,.\a.";

—‘o‘

ustrI'.'oI'.

'Ir

II‘I"

A'

.’.‘..

Ii-..

..-

'I'

o'l

IV.

—I

I‘

~..

'.

.

“'-"J"O'.A'.;«;.-.I

,'.

.

“I:.’0.\.".

..

‘-

'-

‘_

oI

I"

I'Al"

IN

.I

-3

nrnn-J-AF‘;

A

O

' r503 Tfi

BF A P4

ER

0

A

E'MFAUR

,

AAAAA ,~

I

t

A

-5

I

.0‘00

.1.

I

‘0.0:

..(hunt

a...

.

.. J,

,_ l.‘

. ’13‘ . I

-_..- .

. .

-

A .

. —‘ ‘1

. - ,c

0

II .I

.. .0-

.

- .o .

O u - a

.

-. h

. b‘ o

.

s . . -_ c-.

. - o.

. Q. '

-‘9. .‘

- .._., .

‘7...- ‘

“\-~ .

o

a'.

4 4 '

.. -.

o

Y "I

_.

q-wvrr

o

A

O

Q

"d'_.

o"

7"v—‘Evr' x‘ycmm

(J'\_

>

-<

’'Q

\.

.“h...

.___,.‘

ABSTRACT

RELIABILITY OF A PERFORMANCE TEST FOR THE METAL TRADES

By

Alan L. Greenthal

A study was carried out to examine the reliability of a

performance test constructed for the metal trades. End products were

evaluated by non-tradesmen, using precision instruments for tolerance

measurements and a system of benchmarks for quality of finish of the

end products. Interjudge agreement was high. The five tasks on the

test yielded high internal consistency reliability. Retesting the

subjects after a period of 7 to 20 weeks did not show stability of

scores. This lack of test-retest reliability was not unanticipated.

The interim period was too long to assess test-retest reliability.

The Optimal time period would have been one to two weeks, a period

of time which would not have allowed for the differential training and

practice that the testees went through in the longer time period. The

use of performance tests by government and industry in occupational

licensing and in job selection is discussed.

\.

RELIABILITY OF A PERFORMANCE TEST FOR THE METAL TRADES

By

Alan L. Greenthal

A THESIS

Submitted to

Michigan State University

in partial fulfillment of the requirements

for the degree of

MASTER OF ARTS

Department of Psychology

1975

ACKNOWLEDGMENTS

This study was made possible through a grant funded by the

U.S. Department of Labor. Dr. Frank L. Schmidt was the principal

investigator. Mr. Milt Murto, Mr. William Main, Mr. Duane Ebbert,

Mr. William Neggly, and Mr. Walter Okrongley were the representatives

of the participating companies involved with the project. Mr. Ralph

Vanderslice was the master machinist whose assistance was invaluable

to the project. Special thanks goes to Dr. John Hunter for his help

with the data analysis. Dr. John Berner, Ms. Felicia Williams,

Ms. Susan Badertscher, Ms. Anna Toth, and Ms. Barbara Ralsky were

the evaluators of the end products. The help and guidance of

Dr. Frank L. Schmidt, as chairman of my masters thesis committee,

was invaluable in carrying out this study. I would also like to

thank Dr. Neal Schmitt and Dr. John Wakeley for serving on my

committee.

ii

TABLE OF CONTENTS

Page

LIST OF TABLES .......................... iv

LIST OF APPENDICES ........................ vi

INTRODUCTION........................... l

PROCEDURE ............................ 24

RESULTS ............................. 34

Inter-Judge Reliability ................... 34

Internal Consistency Reliability............... 50

Correlation Matrix ...................... 53

Test-Retest Reliability ................... 55

DISCUSSION............................ 57

Evaluating the Evaluators .................. 57

Evaluating the Test ..................... 60

Evaluating the Test (The Second Time Around) ......... 62

Conclusions ......................... 63

APPENDICES............................ 66

REFERENCES............................ 94

iii

Table

10.

ll.

12.

13.

LIST OF TABLES

Intercorrelations for work sample measures

(Campion, 1972) .....................

Interscorer reliability of subtests (Bornstein

et al., 1957) ......................

Test-retest reliability (Bornstein et al., 1957) .....

Subtest intercorrelations (Bornstein et al., 1957) . . . .

Intercorrelations among trouble-shooting subareas

(Siegel and Jensen, 1955) ................

Inter-judge reliability..................

Internal consistency (coefficient alpha) .........

Contents of correlation matrix ..............

Inter-judge reliability of tolerance scores--raw

measurements (computed on first 68 subjects) ......

Inter-judge reliability of tolerance scores--scored

measurements (computed on first 68 subjects) ......

Inter-judge reliability of tolerance scores by

machine--raw measurements ................

Inter-judge reliability of tolerance scores by

machine--scored measurements ..............

Average inter-judge reliability of each tolerance

dimension (corrected by Spearman-Brown formula for

three judges) ......................

iv

Page

12

14

15

18

28

31

32

35

36

38

38

39

LIST OF TABLES (cont.)

Table

l4.

15.

16a.

16b.

17a.

17b.

18.

19.

20.

21.

22a.

22b.

Page

Inter-judge reliability for finish dimensions ....... 41

Comparison of inter-judge reliabilities for tolerance

and finish ....................... 43

Means and standard deviations of raw measurements (odds) . 44

Means and standard deviations of raw measurements (evens). 45

Means and standard deviations of scored measurements

(odds) ......................... 46

Means and standard deviations of scored measurements

(evens) ......................... 47

Coefficient alpha by judge for tolerance scores ...... 49

Correlations of absolute deviations of scored

measurements with time ................. 5l

Internal consistency reliability ............. 52

Intercorrelations of scores within the performance test. .\ 54

Test-retest reliability by machine ............ 55

Test-retest reliability by dimension ........... 56

LIST OF APPENDICES

Appendix . Page

A. TRANSCRIPTS OF TAPED INSTRUCTIONS AND

BLUEPRINTS OF TASKS ................... 667

Transcripts of Taped Instructions

Horizontal Mill Instructions ............ 66

Vertical Mill Instructions ............. 67

Drill Press Instructions .............. 68

Lathe Instructions ................. 69

Surface Grinder Instructions ............ 7O

Blueprints of Tasks

Horizontal and Vertical Mill Tasks ......... 7l

Drill Press Task .................. 72

Lathe Task II .................... 73

Surface Grinder Task ................ 74

B. TOLERANCE EVALUATION SHEETS. FINISH EVALUATIONS,

AND EXPLANATION OF DIMENSION NUMBERS .......... 75

Tolerance Evaluation Sheets

Horizontal Mill Task ................ 75

Vertical Mill Task--Milling a Pocket ........ 77

Drill Press Task .................. 79

Lathe Task II--Boring, Facing, and Chamfering. . . . Bl


Finish Evaluations -

Horizontal and Vertical Mill Tasks ......... 83

Drill Press Task .................. 84

Lathe Task ..................... 85


Explanation of Dimension Numbers ........... 87

vi

LIST OF APPENDICES (cont.)

Appendix Page

C. EXPLANATION OF SCORING SYSTEM AND 2 POINT SYSTEM ..... 89

Explanation of Scoring System. . . I.......... 89

2 Point System .................... 9]

vii

INTRODUCTION

McClelland (1973), in a critique of the testing movement in the

United States, suggests that there is an overreliance on intelligence

or aptitude testing, and that the evidence of validity of these kinds

of tests does not support the widespread use that these tests have been

receiving. This position reflects the movement in the United States

which advocates less emphasis on tests of ability and more emphasis on

tests of achievement. McClelland suggests a number of ways to improve

this situation of overreliance on intelligence or aptitude testing.

His first suggestion is that the best kind of testing is criterion

sampling. He states that there is ample evidence to show that tests

which sample job skills will predict proficiency on the job and sug-

gests that there should be less reliance on paper and pencil tests

which tap a general intelligence factor or other "unrelated" abilities.

Wernimont and Campbell (l968) propose an alternative to the

"classic validity model," which results in low validities and misappli-

cations. Their alternative is the “behavioral consistency" approach

(also advocated by Campion, 1972) which is based on the idea that the

best indicator of future performance is past performance. The classic

validity model uses tests as "signs" to predict, and it is suggested

that the use of "samples“ of behavior would work better. Part of this

behavioral consistency approach involves the use of work samples or

job samples. According to Wernimont and Campbell the consistency ap-

proach would reduce or eliminate problems associated with faking and

response sets, discrimination in testing, and invasion of privacy.

Job sample testing falls into the general category of per-

formance testing. According to Adkins, et al. (l947) a performance

test is one in which the subject is directed to carry out some ac-

tivity. There are two kinds of performance tests. The first is

aptitude testing, which is not job sampling, according to Adkins,

et al.1 An example of this kind of test would be some sort of form

board test. The second kind of performance test is an achievement

test, which is a job sample test. The authors, in 1947, were not very

enthusiastic in recommending the use of performance tests. They sug-

gest that these tests may be less valid than written tests.

Twenty-five years later times have changed, and while there may

not be widespread use of performance testing, the idea of this kind of

testing seems to be one which has gained much popularity among psychol—

ogists, especially those in the personnel area. "Today it is generally

1The exclusion of work samples from aptitude testing may not be

totally accurate. See, for example, the report on Job Trials of the

Jewish Employment and Vocational Service (1973).

conceded that written tests of trade knowledge are not a very depend-

able way to evaluate shop performance and that without some type of

direct or indirect measure of actual performance it is unlikely that

we can make an accurate assessment of an individual's trade competence”

(Boyd and Shimberg, 1971).

O'Leary (1973) points out that predictive or concurrent val-

idity is not ideally the only requirement for fair selection. He sug—

gests that content validity should be emphasized. This is especially

important in light of Title Seven of the 1964 Civil Rights Act.

O'Leary suggests that the more nearly the test duplicates the specific

tasks to be performed on the job, the greater the chances are of devel-

oping selection devices that are fair. He suggests that job sample

testing be used to meet this standard. In addition to.the 1964 Civil

Rights Act, the emphasis on content validity becomes important to em-

ployers who are bound by the guidelines of EEOC (1970), OFCC (1971),

and recent court decisions, especially the Supreme Court case of

Griggs vs. Duke Power (1970). The reason <If fair employment prac-

tices seems to be the most compelling one for a growing importance of

job sample testing.

O'Leary's article has not gone without criticism. Gael (1974)

and Blood (1974) criticize O'Leary for not backing up his statements

with empirical evidence. Cole (1974) points out that the issue of

content of the predictors used is a matter of social values and public

policy, and not a question of psychometrics as presented by O'Leary.

These critics, generally speaking, were not so critical of job sample

testing, but were critical of the way O'Leary presented the idea.

A system of performance testing.has been successfully used by

the New Jersey Civil Service Commission in the selection and training

of people in various trades, such as carpenters, brick layers, auto

mechanics, and truck drivers (Scheuer, 1970). Some advantages of per-

formance testing are claimed in the article:

1) they are the simplest and most economical type of test to

prepare and administer for trades and related positions;

2) they yield results that are more reliable than those of

written or oral tests;

3) a large variety of items are available just by selecting

them from projects in the various trades;

4) there is a much smaller likelihood of failing the candidate

who can do the job, but who lacks the verbal ability to

explain what is to be done;

5) oral or written tests "turn off" the people that need to be

hired--their reaction to performance tests is much more

positive.

The last point is also supported by Steel et al. (1945), who present

evidence showing that people like work sample tests better than other

kinds of testing, although they do not necessarily find it to be

easier than a dexterity test, which is what the work sample test was

compared to.

Scheuer's article is written.in a non-scientific manner with

little empirical evidence presented to back up his arguments. Scheuer

states that performance tests are more reliable than written tests, yet

he has no empirical evidence to back this up. He states that perform-

ance tests are the most simple and economical type of test, yet it has

been complexity and prohibitive costs that have been major reasons for

such tests not being used more in the past. It is interesting and

important to be aware that a system of performance testing has been

successfully employed in New Jersey; however, a more scientific ac-

counting of its success could clarify some of the points that Scheuer

raises.

O'Leary (1973) states that job sample testing "helps the com-

pany to learn something important about the applicant's suitability for

the job, and it enables the applicant.to learn something important

about the job's suitability for him." This type of mutually beneficial

situation also characterizes use of job samples in counseling disadvan-

taged people. The Jewish Employment and Vocational Service of Phila-

delphia (1968) has studied the use of a work sample program in coun-

seling and training disadvantaged applicants and has found the program

to help counselors and counselees in a number of ways. Spergel and

Leshner (1968) in writing about this program state that the major

virtue of the work sample approach is that it is reality oriented.

Gordon (1969) reviews how different agencies have used work sample

testing, mainly in counseling disadvantaged youth, and describes sev-

eral reasons why he thinks the work sample technique is an extremely

valuable one:

1)

2)

3)

4)

5)

6)

7)

8)

It is non-verbal.

It has high content validity.2

Obvious relevance of the test.appears more sensible and

therefore acceptable to.test-suspicious youth who are thus

likely to be well motivated to perform on it--more moti-

vated than in other testing situations.

It is likely to be a better predictor of job performance.

It makes such apparent good sense that it may be more

attractive to employers.

It allows for self-assessment of vocational skills and

an opportunity to discover self-interests; scores are

easier to understand than paper and pencil test scores.

Work sample testing provides a firm reality on which to

base self-assessments and provides a concrete base for

disadvantaged people's image of occupations and work

careers.

It is possibly less of a device for predicting success

and more of a device for producing success.

A number of authors have addressed themselves to the issue of

reliability in performance testing. As early as 1945, McPherson stated

that more research into reliability of performance tests is needed.

Part of Adkin's et al. (1947) lack of enthusiasm in recommending the

fully.

2This will only be true if the test is constructed very care-

use of performance tests was due to two examples that they cite which

show that work-samples were not scored reliably; that is, interrater

reliability was low. Shimberg et al. (1972) state that the most serious

shortcoming of performance tests used in occupational licensing exami-

nations is

the 1ack of adequate criteria or standards for evaluating

performance. Raters need clear and specific directions as

to what they are to look for, what constitutes acceptable

performance on a given task, and how much credit should be

deducted for failure to satisfy the criteria in specified

ways. Without guidelines, each rater.is forced to use

subjective measures that are based on his own experience

and standards.

The previous paragraph is addressed to the issue of scoring

reliability--how well judges, raters, evaluators, observers agree in

scoring the parts of a performance test. Another kind of reliability

in performance testing that is important to look at is test-retest

reliability. Wernimont and Campbell (1968) cite a review of the

literature on criterion theory by Ronan and Prien (1966), who come up

with the conclusion that, with the present available data, the ques-

tion of whether job performance is reliable cannot be answered. They

found very few studies that actually used the same criterion measure

to assess performance at two or more points in time.

In the absence of much knowledge concerning the stability

of relevant job behaviors it seems a bit dangerous to

apply the classic validation model and attempt to gen-

eralize from a one-time criterion measure to an appre-

ciable time span of job behavior., Utilizing the consis-

tency notion confronts the problem directly and forces a

consideration of what job behaviors are recurring con-

tributors to effective performance.(and therefore pre-

dictable) and which are not (Wernimont and Campbell,

1968).

Most studies that deal with.the issue.of.reliability in per-

formance testing deal with interjudge kind of reliability. There is a

scarcity of literature on test-retest or internal-consistency relia-

bility, although some studies report intercorrelations of parts (sub-

tests) of a performance test. The purpose of this paper is to examine

issues in the reliability of a performance test. Before this study is

described in detail, a review of the literature which reports reliabil-

ities of performance tests will be presented.

Campion (1972) advocates the behavioral consistency approach of

Wernimont and Campbell. However, he writes that the lack of guidelines

for sampling behaviors seems to be a major obstacle to wider use of the

consistency approach. Using job experts, Campion came up with his work

sample test for maintenance mechanics.. There were four parts of his

work sample test. These parts and their intercorrelations are pre-

sented in Table 1. Subjects were 34 maintenance mechanics. Besides

the work sample data, various paper and pencil aptitude tests were

administered, and foremen's evaluations of subjects were collected.

The work sample test was significantly related to the foremen's evalu-

ations; the paper and pencil tests were not. The method of evaluation

used in the work sample test was a check list of behaviors to look for

TABLE l.--Intercorrelations for work sample measures (Campion, 1972).

I

t

Intercorrelations

Part

B C D Total

A) Installing pulleys and belts . .25 .01 .16 .63

B) Disassembling and repairing a gear box .11 .27 .64

C) Installing and repairing a motor .07 .42

D) Pressing a brushing into sprocket and

reaming to fit a shaft .70

given to an observer. No mention was made in the article about any

evaluation of end products, nor were any reliability data given, al-

though coefficient alpha, computed from the data in Table l, is .40.

One of the most thorougly done research projects in the area of

performance testing was that of Bornstein et a1. (1957). The authors

did research on the Basic Military Performance Test (BMPT), a work

sample type of measure of achievement in basic training in the U.S.

Army. The BMPT consists of 13 subtests, administered at 13 individual

stations, and requires 16 men for administration. These men observe

and record behaviors.

Bornstein's study is particularly significant because he looks

at reliability fairly extensively. Few studies in the literature have

paid close attention to reliability of performance measures.

10

Bornstein looks at test-retest reliability and scoring (interobserver)

reliability.

Performance items on the BMPT were classified into tangible

items (end products) and intangible items (observations). Two ob-

servers scored 43 tangible items and 57 intangible items on a pass-fail

basis. Phi-coefficients were computed for each item as a means of

evaluating interscorer agreement. The mean phi-coefficient for the

intangible items was .611 (S.D. = .227), and for the tangible items it

was .776 (S.D. = .187). An alternate way of reporting interscorer

agreement is reported elsewhere (Bornstein et al., 1954). That way is

the percentage agreement between scorers.3 For tangible items it was

93% (S.D. = .07), and for intangible items it was 87% (S.D. = .10).

The authors conclude that intangible items can be used with only a

slight loss of reliability as compared with tangible items, but the

added validity (from increasing test length with items that are highly

content valid) should more than compensate for the reduction in scoring

agreement.

3The percentage agreement method of reporting interscorer

agreement cannot be interpreted in the same way as an interscorer re-

liability coefficient. The reliability coefficient will be high when

two observers check the same number of items for each observee, regard-

less of whether there is any overlap in the particular items checked.

In this case the percentage agreement would be low. However, a serious

shortcoming of the percentage agreement method is that items with no

variance will inflate the percentage agreement reported. The greater

the number of items with no variance,.the higher the percentage agree-

ment.

ll

Correlation coefficients between scores assigned by the two

different scorers on the 13 subtests ranged from .45 to .95, with a

mean interscorer reliability of .78. Table 2 reports the number of

items contained in each subtest, the number of examinees tested, the

difference between each observer's mean score for each subtest, dif-

ference between standard deviations, and the interobserver correlation.

The authors suggest that the reliability of the test may be too low

for individual diagnosis (or at least limits the use of the test for

this purpose), but it still can have its value in use for group pre-

diction.

Interscorer reliability for the total test was estimated to be

.78, the mean of the subtest reliabilities. Total test reliability

could not be computed directly because the same two observers were not

consistently used for the 13 subtests. The authors present a rationale

for using .78 as a conservative estimate of interscorer reliability of

the total test.4

4Interscorer reliability of the total test had to be estimated

because it was not possible to have a complete set of observers at each

of the 13 subtests. The rationale for using .78 as an estimate of

total test reliability was as follows: Fourteen composites of from

3 to 5 subtests with complete sets of observers were examined. The

reliabilities of these composites computed in two different ways were

looked at. One way was computing interscorer reliability of the com-

posite directly. The other way was calculating the interscorer relia-

bility separately for each subtest and averaging these reliabilities.

It was found that the composite reliability was somewhat higher than

the average reliability in most comparisons, and so the authors took

12

TABLE 2.--Interscorer reliability of subtests (Bornstein-et al., 1957).

Sgaztggtor #:émgf EggmiHZes BELZZén Bgtzzen r

Tested Means S.D.'s

l 9 166 - .22** .18* .87

2 9 245 -l.16** .42 .61

3 7 212 .08 .04 .68

4 6 178 .05 .19** .82

5 10 178 .12** .23** .72

6 10 164 .24** .10 .87

7 5 171 04 -.02 95

8 6 177 .09 .08 .94

9 5 224 .01 .13* .48

10 6 294 - .01 -.O6 .90

11 10 227 .09 .04 .50

12 10 278 .10 .08 .74

13 8 170 .43** -.36** .87

*p < .05

**p < .01

the average of the subtest reliabilities as a conservative estimate of

the reliability of the total test. This "finding" was predictable from

classic reliability theory. When different subscales are assumed to

measure different things, and they are uncorrelated, the best estimate

of reliability is the mean of the individual reliabilities. If the

subscales are positively correlated, as was the case in the Bornstein

et al. study, then the above estimate will be a conservative estimate

of the reliability of the total test.

13

Table 3 presents data on test-retest reliability. Entered in

the table is the subtest, number of examinees taking the subtest, means

and standard deviations for the test and the retest, the differences

between the means and standard deviations, and test-retest reliability.

The mean test-retest reliability of the 13 subtests is .39. The relia-

bility coefficient for the total test is .67.5 As can be seen in the

table, the mean scores went up on the retest. The authors suggest that

learning may have biased the reliability coefficients--without learning,

reliability would have been higher.

Table 4 reports the intercorrelations of the 13 subtests with

each other and with total score. As was the case with Campion's re—

sults, these intercorrelations are low and positive. Coefficient alpha,

computed from the data in Table 4, is .62.

. Other results that Bornstein found were that superior and peer

ratings had little or no relationship to the performance test (this is

in disagreement with Campion's findings) and that the BMPT had a lower

correlation with a reading and vocabulary test than did the written

achievement test that was used (.29 vs. .54). This is an indication of

5The fact that test-retest reliability of the total test is

greater than the mean test-retest reliability of the subtests is to be

expected here. Total test is greater in length than each subtest and

therefore total test reliability will be greater than the average

reliability of the subtests provided that subtests are, on average,

correlated positively.

14

TABLE 3.--Test-retest reliability (Bornstein et al., 1957).

Station No. of Test Retest Diff. Diff.

or Examinees Between Between r

Subtest Tested X' 5.0. X' S.D. . X“s S.D.'s

l 166 2.56 1.11 2.57 1.04 .01 .07 .19

2 165 7.10 1.90 7.62 1.65 .52** .25 .10

3 166 4.60 1.16 4.71 1.29 .11 .13 .32

4 163 2.29 1.40 3.27 1.50 .98** .10 .30

5 157 3.01 .98 3.48 .77 .47** .21** .38

6 167 7.19 1.66 6.96 1.95 .23 .29 .45

7 165 3.30 1.20 4.23 1.13 .93** .07 .22

8 168 3.70 1.81 4.56 1.70 .86** .ll .53

9 169 2.78 1.04 2.96 .85 .18* .19** .39

10 164 3.30 1.25 4.11 .63 .81** .62** .07

11 169 3.39 1.72 3.53 1.60 .14 .12 .41

12 166 1.89 1.27 3.40 1.31 .51 .04 .07

13 168 4.04 2.13 4.77 '2.18 .73** .05 .51

Total 142 49.74 8.52 56.35 6.92 6.61** -1.60* .67

*p < .05

**p < .01

TABLE4.-Subtest

intercorrelations

(Bornstein

et

al.,

1957)

(N

=307).

Intercorrelations

67

810

11

12

13

PNMVLOSONCD

11

12

13

Total

.16

.14

.02

.28

.16

.16

.16

.17

.53

.07

.17

.13

.10

.40

.16

.13

.25

.12

.10

.21

.32

.03

.13

.53

.08

.21

.11

.19

.13

.07

.06

.41

.10

.12

.13

.56

.42

.45

15

16

how a verbal factor, which may be unrelated to performance, is present

in written tests and not in performance tests. This verbal factor will

tend to have a negative impact on the test results of disadvantaged

groups.

Siegel (1954b) studied intraobserver consistency. He describes

the ideal method for determining the consistency of an individual

examiner as a situation where examinee's performance is held constant

over two separate occasions. and the observer's perceptions are allowed

to vary. The best way to do this is with a motion picture. In his

study films were made of Naval Aviation Structural Mechanics taking a

Drill Point Grinding Work Sample Performance Test. Films were shown

twice to five observers, with a one month interval between each show-

ing. Observers were given an evaluation form to fill out while viewing

the film. This form contained questions regarding safety and process

used by the examinees in the film. Intraexaminer consistency was de-

termined by dividing the number of items answered exactly in the same

manner on each showing, by total number of items on the evaluation

form. The figures for percent consistency were 64.3%, 71.4%, 85.6%,

92.3%, 100%. with a mean of 82.8%. Siegel suggests that the range in

intraobserver reliabilities warrant a careful investigation into the

area of intraobserver reliability. It is necessary to have high

intraobserver reliability in order for interobserver-reliability to

be high.

17

McPherson (1945) constructed a work sample test for the wood

shop. She looked at intraobserver consistency of end product measure-

ments, as in constrast to Siegel's process evaluations. End products

were measured at two different times by one psychometrician. She found

intraobserver reliability to be in the area of .97.

Siegel and Jensen (1955) developed a job sample trouble-

shooting performance test for aviation electricians.6 The test con-

tained five subareas, in which the electricians (N = 137) had to

identify certain problems in the functioning electrical mechanisms.

The authors found the split-half reliabilities of the subareas, cor-

rected by Spearman-Brown, to be .90, .59, .72, .84, and .64, and for

the composite it was .86. Intercorrelations between areas of the

performance test are presented in Table 5. Coefficient alpha, com-

puted from these intercorrelations, is .59. The authors report these

intercorrelations to be relatively low, with the exception of I and

III; however, compared to the results of Bornstein (1957) and Campion

(1972) they were not so low. The validity of the test, according to

the authors, proved to be substantial--the more experienced elec-

tricians scored higher.

6The term "job sample" may be inappropriate for the test de-

scribed in this paragraph. Although it is unclear from the article,

it seems that the test was a written description of a situation that

may arise on the job, rather than the situation itself.

18

TABLE 5.--Intercorre1ations among trouble-shooting subareas (Siegel

and Jensen, 1955).

Subtest I 11a 11b III

11a .16

11b .32 .20

III .50 .23 .34

IV .16 .06 .11 .17

Siegel (1955) compared the scoring of tangible and intangible

items of the aviation structural mechanics tests, an individually ad-

ministered performance test. A check-list method was used to score the

intangible items. Interobserver consistency within a test was calcu-

lated using the percent consistency method of Siegel (1954b). Siegel

found that tangible and intangible procedures yield about equal con-

sistency between observers. This differs somewhat from Bornstein's

findings that intangible items were scored slightly less reliably than

tangible items. Siegel attributes his findings to objectivity in the

check-list procedure and grossness of the observations called for.

In another study using check-lists to evaluate performance in

job sample tests, Siegel (1954a) concludes that check-lists (used to

evaluate process and unsafe behaviors) of performance are to be pre-

ferred over "clinical" (subjective) appraisal of end products. The

19

check-liSt is prepared by analyzing a task into component actions which

a man performs in order to complete a task. Siegel reports inter-

examiner reliability in the .90's. His check-list procedure is a more

objective one than clinical appraisal of end products which involves

making subjective evaluations about the quality of the finished

product. The value of this comparison is somewhat shaky. By using

subjective clinical appraisals of end products Siegel is setting up

a straw dog and knocking it over. A more valid and useful comparison

would have been to use some objective appraisal of end products and

compare this with the objective evaluation of process and unsafe be-

haviors that were used in the study.

Fredriksen (1962) constructed an "in-basket" test for managers.

This kind of test can be used to select or promote people in manage-

ment positions. It is performance in nature because it simulates kinds

of activities that the testee would need to carry out on the job. The

test consisted of 68 categories. Testees were scored by 2 observers--

one observer scored the odd numbered items within each category, and

the other scored the even numbered items within each category. Fred—

riksen reports the split-half reliabilities of each category. These

reliabilities, which reflect both internal consistency and interob-

server agreement, ranged from .87 to .00, with a median reliability of

about .40.

20

A few other studies mention that reliability was looked at, but

present no data. Besnard and Briggs (1967) report a study on develop-

ment of a performance test to evaluate maintenance personnel for the

Air Force's E-4 Fire Control System. They report interobserver agree-

ment to be high. Robins et a1. (1958) report a test to evaluate a non-

commissioned officer's ability to generate the support of subordinates

in getting a job done. Authors report "adequate" reliability (internal

consistency) of the total test, although one of the three subscores was

reported to have less than adequate reliability. Havron (1954) reports

close agreement between observers in a test to assess effectiveness of

infantry rifle squads in the army.

In constructing a performance test, or in choosing which test

to use for some research or industrial purpose, one may be faced with

the issue of whether to use a test in which process is most important

or one where the product is what really counts. Boyd and Shimberg

(1971) write about the importance of process (intangible) vs. product

(tangible) evaluation. They report that in the original planning of

a structural mechanics test, equal weights were to be given to process

and product. However, this was changed in the end because the chief

machinist mates objected, saying that the end product is what is most

important. Schmidt (1974) gives a number of advantages of evaluating

end products over process evaluations:

21

1) The number of test administrators can be reduced; the nature

of the test may be group administered rather than individ-

ually administered.

2) It may be less difficult to train nonpsychologists (or

people who have had no previous experience with the sub-

ject matter of the test) to evaluate end products than to

observe and record behaviors.

3) Interevaluator agreement may be higher when end products

are what are being evaluated.

4) Examinees may feel less threatened or nervous when they are

not constantly being watched.

5) Evaluation of end products can take place after the test,

at the convenience of the evaluator.

6) The resulting scores should be more valid, since it is the

ability to produce high quality finished products which is

important in real life; the method of how they were pro-

duced is merely a means to this end and is therefore of

secondary concern.

This study deals with reliability of a job sample test con-

structed for one of the skilled trades--that of machinist. End

products, rather than process, are what were scored. The type of

product that a machinist produces can be evaluated on two dimensions.

One is the accuracy of the actual physical dimensions. This calls on

the evaluator to make physical measurements, usually using precision

measuring instruments. The other dimension is quality of finish, or

how smooth or rough the end product turned out to be.

Pertaining to the finish dimension, Tiffin and Rogers (1941)

report a study in which 150 judges coded (evaluated for finish) 150

sheets of tin. The sheets contained acceptable sheets and sheets with

22

four different kinds of defects. The sheets were presented in random

order to the inspectors, who identified the sheet as-acceptable, or

called out the kind of defect it contained. Reliabilities were com-

puted on each categroy and were reported to be between .68 and .90.

These reliabilities represent the extent to which repeated or duplicate

measurements of each inspector by means of this "coded stack test"

would result in the same score for the defects in question for each

inspector. Large variances in time taken to inspect the sheets were

also reported. While .90 is fairly reliable, .68 is not so desirable.

Furthermore, the variation in time needed to inspect leaves room for

improvement in the inspection process.

Tiffin and McCormick (1965) point out the well-known fact that

judgments that are relative, rather than absolute, are more accurate.

They suggest the use of "limit samples" to evaluate finish. Limit

samples are samples of work pieces that are just barely acceptable

enough to fall into a certain category. These limit samples have the

maximum amount or degree of defects that would be allowed for a par-

ticular category. The inspector is to compare the pieces to be in-

spected with the limit samples and make his judgments accordingly.

Such a comparison usually results in a judgment more adequate than

when an inspector relies on a "memory image" of the degree of a defect

that is acceptable vs. not acceptable. Kelly (1955) found in a study

of inspection of glass panels that untrained subjects were able to make

23

consistent distinctions between the pieces of glass, when a procedure

using relative judgments was employed.

Stuit (1947) reports on interjudge agreement in grading end

products. Four judges evaluated 30 "samplers“ prepared by students

in a basic machinists course, using the usual method of combination

squares. Reliabilities ranged from -.11 to .55. A set.of taper gauges

and caliper gauges was devised, and two judges evaluated two more sets

of samplers yielding reliabilities of .93 and .96. This exemplifies

the importance of using the correct, and most accurate, measuring de-

vices if scoring relaibility (and retest and internal consistency re-

liability, as well, since these reliabilities depend on scoring relia-

bility) is to be high.

Lawshe and Tiffin (1945) and later Evans (1951) report on the

accuracy of precision measurements in industrial inspection. Their

studies show that accuracy of inspectors is far less than assumed by

most authorities in the field. They also found that measurements made

by apprentices are as accurate as those made by journeymen, and ac-

curacy is unrelated to age, seniority, or experience. Evans reports

that inexperienced people can be trained to use micrometers as accur-

ately as experienced industrial workers. The New Jersey Civil Service

Commission has successfully employed non-tradesmen in administering

their tests and evaluating end products (Scheuer, 1970).

PROCEDURE

The present study was carried out as a part of a project funded

by the U.S. Department of Labor (Schmidt et al., 1974). The project

had two main objectives. The first and most general objective was a

pilot empirical evaluation of a set of innovative procedures for the

construction of valid, reliable, and practical job sample tests in the

skilled trades and technical occupations. The second objective was the

assessment of the relative impact of performance tests and traditional

paper-and-pencil achievement tests on the employment opportunities of

minority and disadvantaged persons (Schmidt, 1972). The present study

focused on reliability of the test which Schmidt et al. developed.

The test was constructed for machinists, and consisted of five

tasks to be carried out on five different machines: vertical mill,

horizontal mill, drill press, surface grinder, and engine lathe. The

test was administered in a machine shop by members of the project's

research staff. Subjects (testees) were primarily apprentices in the

tool and die making, and related trades, who had had at least one year

of experience. Some journeymen also participated. It was possible to

obtain only a small number of machinists, but the tool and die makers

24

25

were adequate for the research purposes, since they all had sufficient

experience on the machines tested. Subjects came from various plants

of a large automobile company in Detroit, and three factories in

Chicago who employ people in the trades for whom the test was con-

structed.

After a brief introduction about the nature and purpose of the

research project and the performance test, subjects (usually five at a

time) moved out to the testing area (the machine shop). At the admin-

istration table were all the testing materials. Only one administrator

was needed because process behaviors were not being recorded. The

testing materials consisted of pieces of metal stock and blueprints.

At each machine were all tools necessary to carry out the various tasks

and tape recorded instructions explaining what the testee was to do.

The blueprints diagrammed the task to be machined and specified what

the tolerances of the dimensions of the finished product were to be.

Transcripts of the taped instructions, and the blueprints can be found

in Appendix A. Each testee was assigned to a station (machine), his

starting and finishing time was recorded, and upon completion of the

task his finished workpiece was turned over to the test administrator.

He was then assigned to a new station, and this process was continued

until all five tasks were completed.

The end products were labeled and taken back to the project

office of Michigan State University to be evaluated. The evaluation

26

forms can be found in Appendix B. The finished product was evaluated

on two dimensions-~finish and tolerances. The evaluation forms in-

struct the evaluator on what he or she was to do--where to measure

and what measuring instrument to use in the case of tolerance evalua-

tion, and what to look for or feel for in the case of finish evalua-

tion. In order to decide on the tolerances, the project staff en-

listed the services of a machinist journeyman. The same person also

helped set up the "benchmark" finish evaluation system. This system

was similar to the limit sample described earlier. Benchmarks were

selected from the end products themselves. These benchmarks repre-

sented different categories corresponding to qualities of finish. The

evaluator was to compare the piece to be evaluated with the benchmarks

and decide which category the piece fell in by making a judgment as to

which benchmark the workpiece was closest to.

Project staff members were instructed on evaluation procedures

by the project's machinist consultant. He showed the staff members how

to use the various measuring devices. These devices were a dial

micrometer (theoretical accuracy .0001 inches), dial caliper (theoret-

ical accuracy .001 inches), a scale, sliding parallels, and a telescOp-

ing gauge. Evaluators were project staff members, and student employees

who were trained, by project staff members, in how to use the various

instruments and evaluate the end products. In light of findings by

Evans (1951) and the experience of the New Jersey Civil Service

27

Commission (Scheuer, 1970), it was felt that non-tradesmen could

reliably evaluate the end products. This, however, is a matter of

empirical research.

The purpose of this research was not to prove specific research

hypotheses; it was to investigate the thesis that the kind of test de-

scribed above can be constructed to yield high (or at least adequate)

reliabilities (scoring, retest, and internal consistency). Scoring

reliability, or interjudge reliability, assesses how consistent various

evaluations are across judges. In order for a job sample test to be

used by industry, by a government agency, or in a counseling setting,

it must be scored reliably, or individual scores will be meaningless.

Table 6 outlines the interjudge reliabilities that were computed from

the data in this study. Reliabilities for tolerance evaluations, un-

less otherwise stated. are computed using a two point scoring system,7

as outlined in Appendix C. The subject received 2 points if his mea-

surement fell within the first tolerance (corresponding to the first

tolerance specified on the evaluation forms in Appendix B), 1 point

within the second tolerance, and 0 points if he did not meet either the

first or second tolerance. The finish evaluations were on a 3, 4, or 5

7Two points was the highest score any subject could receive on

a particular dimension, except in the case of the surface grinder,

which was scored on a three point system.

28

point system, corresponding to the benchmarks on the evaluation forms

found in Appendix B.

TABLE 6.--Inter-judge reliability.

A. Tolerance Scores--R's for first 68 §s (3 evaluators per subject).a

1) on raw measurements for all dimensions

2) on scored measurements for all dimensions' raw measurements

3) on scores for each of the five machines

4) on scored measurements for each of the five machines

5) average interjudge reliability for each dimension corrected

by Spearman-Brown for 3 judges

B. Finish Scores

1) for each finish evaluation

2) for total test

aThere are two separate reliability analyses--one for odd numbered

subjects and one for evens.

All finish evaluations were done by four evaluators. Two

people evaluated the odd-numbered subjects and two people evaluated the

even-numbered subjects. Six people evaluated the first 68 subjects'

tolerance dimensions. Three evaluated the odd numbers, and three

evaluated the even numbers. The first 68 subjects were the only ones

whose workpieces were evaluated by three people on tolerance, and

therefore only data from the first 68 subjects were used in computing

inter-judge tolerance reliabilities. The remainder of the subjects

(N = 42) were evaluated by two judges.

29

There are a number of questions related-to inter-judge relia-

bility that this study addressed itself to. They are as follows:

1)

2)

3)

4)

5)

Can non-tradesmen be used to score tests reliably?

Can tolerance scores be evaluated more reliably than finish

scores?

00 certain evaluators have constant biases (too high or too

low) in the way they measure the various tolerance dimensions?

Are some evaluators more "sloppy" than others in the way they

evaluate tolerances?

Do measurements become more reliable over time?

A "yes“ was hypothesized to be the answer to all five questions.

The following analyses were carried out to test the questions related

to inter-judge reliability:

1)

2)

3)

Inter-judge reliabilities were examined: high reliabilities

(around .90 and above for tolerance, somewhat less for finish)

are evidence that non-tradesmen can score the test reliably.

Mean inter-judge reliabilities on each machine were computed

for tolerance and finish, and were compared. Mean inter-judge

reliability on the total test was also computed and compared.

Fisher's r to 2 were computed on the reliability coefficients

and were compared.

Eight dimensions were selected to test this hypothesis--four

of these involved measurements with the calipers and four with

the micrometer. These dimensions were selected because they

involved the most straightforward use of micrometer and cal-

ipers. In order to be considered a "biased" measurer, an

evaluator had to be consistently high or low in his scored

measurements with a particular instrument, and this bias had

to reach a significant level. Matched-pair t-tests were used

to test the magnitudes of measurements that were consistently

biased.

30

4) Three different analyses were carried out to look at slop-

5)

piness of raters:

a) Variances of the evaluators were compared using F-tests.

b) Coefficient alphas were computed on each judge and were

compared using Fisher's z-transformations.

c) Interjudge reliabilities were examined in order to de-

termine whether any one judge was more sloppy than the

other two. If a sloppy judge was found, reliabilities

were to be compared using Fisher's z-transformations to

test for magnitude of the differences.8

Deviations (absolute values of measurement of measurer 1 minus

measurer 2) were correlated with time and tested for statis-

tical significance.

Internal consistency reliability assesses the extent to which

the test measures one general factor. It is directly dependent on how

the different parts of the test relate to each other. Besides the

various coefficient alphas, listed in Table 7, a correlation matrix

b)

8The rationale behind these analyses were as follows:

Variance would be at a minimum when there is no sloppiness.

The measurements of a sloppy evaluator will be more varied

(his measurement will differ from the true dimension) than

a less sloppy evaluator.

No method of significance testing specifically addressed to

the issue of comparison of two coefficient alphas could be

found. It was therefore decided to treat coefficient alpha

as an re, and Fisher's r to z transformations were computed

on the square root of alpha. These z's were compared.

If r12 is higher than r 3and r2 this means that judge 3

is a sloppier judge (less3reliabIe) than judge l or judge 2.

There will, of course, always be one judge with lower r's,

so in order for a judge to be considered sloppy. he needed

to have r's that were consistently lower.

31

TABLE 7.--Interna1 consistency (coefficient alpha).a

——“—“—~—‘ -

_t_, 1

A. Tolerance Scores computed on first 68 subjects using the median

measurement.

8. Finish Scores.

C. Tolerance plus finish scores.

3This analysis was done for each machine, for total test containing as

many dimensions as there are measurements, and for total test, using

each machine total as one item.

containing various parts of the test, outlined in Table 8, was com-

puted.

Due to practical considerations of the project from which this

study came, an adequate assessment of the test-retest reliability could

not be carried out. Retest subjects were volunteers who had responded

positively to a letter sent to them at the end of the testing period

for the first 68 subjects. Because of this, the time interval between

the first and second testing was too long. The optimal time period

would be from one to two weeks. Any period longer than that would have

a negative effect on test-retest reliability. This is because there

was no standardization in the apprenticeship program, resulting in

different amounts of training and practice on the machines for

32

TABLE 8.--Contents of correlation matrix.a

A. Performance Test

1) Total tolerance plus finish score

2) Total tolerance score ’

3) Total finish score

8. Tolerance Plus Finish Scores for:

1) Horizontal Mill

2) Vertical Mill

3) Drill Press

4) Lathe

5) Surface Grinder

C. Tolerance Scores for:

1) Horizontal Mill

2) Vertical Mill

3) Drill Press

4) Lathe

5) Surface Grinder

D. Finish Scores for:

1) Horizontal Mill

2) Vertical Mill

3) Drill Press

4) Lathe

5) Surface Grinder

aThis can be done for each dimension, for each machine, and for total.

33

different people. The retest reliability coefficients computed in this

study9 should not be taken to represent what the true test-retest

reliability of this performance test is. Instead they represent the

extent to which subjects' scores are stable over a period of time which

allowed the testees to have different amounts of training and practice

on the difference machines.

9Test-retest reliability was computed on tolerance and finish

for each dimension, for each machine, and for total.

RESULTS

Inter-Judge Reliability

Tolerance scores are reliable: The data on inter-judge relia-

bility are presented in Tables 9-14. With only a few exceptions,

inter-judge reliability was very high. Table 9 presents the inter-

judge correlations between the three judges who evaluated the odd-

numbered subjects, and between the three judges who evaluated the

even-numbered subjects. These correlation coefficients were computed

on the raw data (that is, on unscored measurements). These correla-

tion coefficients are affected by extreme measurements (measurements

that are far from specified tolerances) in a positive direction, and

therefore are, in most cases, slightly higher than the correlation

coefficients found in Table 10, which presents inter-judge reliabil-

ities computed on the scored measurements. The inter-judge relia-

bilities on the scored measurements are what need to be examined to

assess the extent of interevaluator agreement, because they are not

affected by extreme deviations from tolerance, or by extreme disagree-

ment between evaluators on only one or two measurements, in the way

that the raw score inter-judge reliabilities are. Furthermore, raw

34

35

TABLE 9.--Inter-judge reliability of tolerance scores«-Raw measurements

(computed on first 68 subjects).

Odds Evens

Evaluators *

1 99 1 00 1.00 99 99 .98

2 99 1.00 1.00 99 98 .96

Horizontal 3 1.00 .99 .99 1.00 1.00 1.00

Mill 4 1.00 1.00 1.00 1.00 1.00 1.00

5 96 99 .94 l 00 1 00 1.00

6 98 98 .97 l 00 l 00 1.00

1 l 00 .86 86 1.00 l 00 l 00

2 1 00 .87 87 1.00 1 00 1 00

Vertical 3 97 .98 96 .99 99 1 00

"11] 4 .99 .99 .99 .99 .98 .98

5 72 .98 .69 99 99 99

6 93 .92 .99 97 49 39

7 .92 .95 .95 99 92 92

1 98 .99 .98 1.00 1.00 1.00

2 1 00 1.00 1.00 .98 99 98

Drill 3 1.00 1.00 .99 .97 .99 .98

Press 4 l 00 1.00 99 1.00 1 00 l 00

5 99 1.00 .99 1.00 1.00 1 00

6 99 .98 .98 .96 93 91

1 .97 99 97 1.00 1.00 l 00

2 .95 96 1 00 1.00 l 00 l 00

Lathe 3 Y 95 95 .99 .99 l 00 99

4 .76 55 .49 87 95 93

5 .75 90 .86 93 93 95

1 97 99 .97 99 99 99

Surface 2 .99 .97 .97 .93 .99 .92

Grinder 3 .99 .99 .99 .73 .93 .86

4 .97 99 .97 83 93 96

*For a description of what tasks the dimension numbers in this column

represent, see Appendix B.

36

TABLE 10.--Inter-judge reliability of tolerance scoresn-Scored measure-

ments (computed on first 68 subjects).

Odds Evens

Evaluators

1 & 6 l & 7 6 & 7 2 & 4 2 & 5 4 & 5

1 92 94 .90 .90 96 9O

2 96 98 .94 85 87 73

Horizontal 3 .84 .79 .78 .84 .77 .89

Mill 4 .87 .89 .84 .86 .83 .83

5 81 76 71 .87 9O 92

6 90 72 68 .90 88 98

1 87 89 .88 94 91 89

2 84 84 .89 .90 92 94

Vertical 3 .82 .85 .72 1.00 .98 .98

Mill 4 .77 .82 .72 .88 .76 .88

5 89 89 .88 1.00 1.00 1 00

6 82 .76 .82 84 94 77

7 85 89 .82 95 93 95

l .72 77 .94 .63 87 75

2 .94 92 .98 .85 95 89

Drill 3 .83 88 .85 81 88 83

Press 4 .80 .89 86 .98 98 1 00

5 .91 93 98 .96 96 95

6 .98 96 98 .90 92 88

l .66 71 52 .82 84 79

2 .77 77 84 .78 93 84

Lathe 3 .67 77 77 .74 .93 81

4 -.29 - 13 .22 .33 63 69

5 -.21 - 38 .24 58 40 50

------------------------------------

1

Surface 2 . . . . .

Grinder 3 .86 .91 .90 .94 .88 .88

4

37

scores cannot be used as performance test scores because they do not

reflect "skill," or how close the testee's measurements come to spe-

cified tolerances.

Table 11 presents inter-judge correlations by machine for the

three odd and three even evaluators on raw measurements. Table 12 is

the analogue of Table 11 for scored measurements. These tables reveal

that inter-judge reliability is very high (in the .90's) for all the

machines except the lathe. Table 13, which presents the average inter-

judge reliability for each measurement, corrected by Spearman-Brown for

three evaluators, also bears out this finding. As can be seen in

Table 13, all the tasks have reliability coefficients in the .90's

except for the lathe. Furthermore, the evaluators of the odd-numbered

subjects had lower agreement than the evaluators of the even-numbered

subjects. This was tested by the Wilcoxon matched-pairs signed-rank

test (p < .01 for a two tailed test). Although all of the lathe mea-

surements produced agreements that were somewhat less than agreement

on measurements of other machines, measurements number four and five

were far below. These two measurements were the inside and outside

chamfers. The process of measuring these dimensions required that the

evaluator make a very fine measurement using a scale, which is a non-

precision instrument. One other measurement required the use of a

scale--the diameter of the countersink on the drill press task (drill

press measurement number 6). The reliabilities of this measurement

38

TABLE ll.--Inter-judge reliability of tolerance scores by machine—~raw

measurements.

Odds Evens

Evaluators

1 & 6 1 8 7 6 & 7 2 & 4 2 & 5 4 & 5

Horizontal Mill .99 .99 .98 1.00 1.00 .99

Vertical Mill .97 .96 .94 .99 .95 .94

Drill Press .99 .99 .99 .99 .99 .99

Lathe .84 .83 .88 .96 .98 .97

Surface Grinder .99 1.00 .99 .90 .97 .96

TABLE 12.--Inter-judge reliability of tolerance scores by machine--

scored measurements.

Odds Evens

Evaluators

l & 6 1 & 7 6 a 7 2 & 4 2 & 5 4 & 5

Horizontal Mill .93 .93 .91 .93 .92 .96

Vertical Mill .92 .93 .94 .98 .97 .96

Drill Press .93 .94 .97 .92 .97 .95

Lathe .45 .45 .59 .78 .85 .85

Surface Grinder .95 .95 .95 .97 .97 .97

39

TABLE 13.--Average inter-judge reliability of each tolerance dimension

(corrected by Spearman-Brown formula for three judges).

Evaluators Odds Evens

1 .97 .97

2 .99 .93

. . 3 .92 .94

Horizontal M111 4 .95 .94

5 .91 .96

6 .91 .97

l .96 97

2 .95 .97

3 .92 1.00

Vertical Mill 4 .91 .94

5 96 1.00

6 92 .94

7 95 .98

1 97 .90

2 98 .96

3 95 .94

Drill Press 4 94 .99

5 98 .98

6 99 .96

l .84 .93

2 .92 .94

Lathe 3 .89 .93

4 '.23 079

5 -.45 74

-—---- ------------------------------

Surface Grinder

40

were higher than that of the chamfer. This can be explained by two

factors. One is that the measurement of the countersink dimension can

be read much easier than the chamfers. The scale is simply rested flat

on the workpiece. With the chamfers, it is not so easy. The scale has

to be held on an angle and is therefore subject to errors caused by

unsteadiness of the evaluator's hand and by placing the scale on the

workpiece at the wrong angle. The second, and most important, factor

is that the tolerances for the chamfer were more stringent. This has

a great effect on the reliability of the sggrgg_measurements. The

inter-judge reliabilities on the raw measurements for the chamfers

(Table 9) were much higher than those for the scored measurements

(Table 10).

Finish scores are reliable: Table 14 presents inter-judge

reliability for finish scores. Most correlations were in the .70's

and .80's after being corrected by Spearman-Brown for two judges.

There was only one very low correlation (the third lathe dimension--

finish of the chamfer), and this was low only for the judges of the

odd-numbered subjects. Overall, these inter-judge correlations may

be considered adequate, but not impressive in magnitude. It is the

more subjective elements involved in finish evaluations that can

explain why finish reliability is not extremely high.

Non-tradesmen can evaluate reliably: All evaluators in this

study were non-tradesmen, had no previous experience in using the

41

TABLE l4.--Inter-judge reliability for finish dimensions.*

Machine Dimensions Judges 3 & 8 (odd) Judges 5 & 9 (even)

Horizontal

M111 1 .62 (.77) .66 ( 79)

Vertical 1 74 ( 85) 53 ( 74)

”1" 2 71 ( 83) 72 (.84)

Dr111 1 84 ( 91) 81 ( 90)

Press 2 51 (.68) 86 ( 80)

l 66 ( 79) 70 ( 82)

Lathe 2 59 (.74) 69 (.81)

3 .11 ( 21) .67 ( 80)

Surface 1 46 ( 63) 71 ( 83)

Gr'"der 2 .54 (.70) .48 (.65)

*r's in parentheses are corrected by Spearman-Brown formula for two

judges.

42

instruments required for tolerance evaluation, and were previously un-

familiar with the machining process. Yet with only a moderate amount

of training on the tolerance measurements (one instructional period

lasting about 20 minutes for each one of the five tasks), and with

initial supervised practice on about five workpieces, evaluators were

able to score the performance test reliably--mostly in the .90's for

tolerance, somewhat lower for finish.

Tolerance evaluations are more reliable than finish evalua-

tions: Tolerance measurements were consistently evaluated more re-

liably than finish. Reliabilities for tolerance were mostly in the

.80's and .90's. For finish they were mostly less than .80. Table 15

presents the data which show that differences between tolerance and

finish reliabilities were highly significant on all machines but the

lathe.

Biases in evaluators' measurements are rare: Tables 16 and 17

present means and standard deviations of raw and scored measurements

for all dimensions. The dimensions which are in parentheses are the

ones which are relevant to testing whether certain evaluators have

measurement biases. The data (in Table 17) was first examined to see

if certain evaluators had consistent biases-~too high or too low. Only

three instances of consistent biases were found: evaluator number 1

had the lowest mean measurement on three of the four micrometer dimen-

sions; evaluator number 4 had the highest mean measurement on three of

43

TABLE 15.--Comparison of inter-judge reliabilities for tolerance and

finish.a

Task

Mean Tolerance Mean Finish p-value for difference

Reliability Reliability between r's

Horizontal

M111 .93 .64 p < 00005

Vertical

Mill .95 .6875 p < 00003

3”” .9467 .755 p < 0002ress

Lathe .6616 .57 p < 25 (N.S.)

SurfaceGrinder .96 .5475 p < 00001

Total .8896 .64 p < .002

aAverage N for tolerance was 33; average N for finish was .60.

Variance of the statistic was computed by the formula:

Var =

1

+

N - 3 °2

44

TABLE l6a.--Means and standard deviations of raw measurements (odds).

Judge

7

Task 6

Mean 5.0. Mean S.D. Mean 5.0.

l 9996 765 147.064 10008.235 145.733 9992.059 147.793

Horizontal 2 9992 353 150.471 9997.059 145.539 9986.176 151.590

Mill 3 6233.588 241.044 6226.794 239.941 6243.588 244.869

(N = 34) 4 6233 000 242.337 6206.059 241.510 6229.500 242.054

5 1287 294 65.759 1292.822 72.161 1281.059 66.068

6 1279 824 66.363 1281.676 66.930 1272.000 69.404

1 2435.647 291.507 2435.059 292.058 2465.118 341.365

2 2436.088 288.757 2434.735 289.139 2462.618 338.861

Vertical 3 999.882 11.654 1002.147 9.783 999.824 11.746

Mill 4 631 353 16.331 631.829 16.813 630.765 16.329

(N = 34) 5 631 382 16.073 629.324 25.637** 632.353 16.562

6 632 735 16.497 633.176 19.196 634.088 19.318

7 1500 676 10.991 1497.794 11.687 1499.971 10.923

1 7603 059 84.140 7592.235 88.965 7594.824 86.16

Drill 2 17499 412 242.307 17496.000 243.100 17498.412 242.816

Press 3 17538 206 250.421 17532.765 250.620 17540.029 248.081

(N = 34) 4 9987 706 235.356 9965.059 236.767 9981.912 225.382

5 10006 824 231.139 9990.353 233.268 10002.794 225.249

6 97 382 5.151 96.882 5.155 97.294 5.119

1 19975 094 74.429 19960.594 70.655 19974.469 74.466

Lathe 2 8034.594 117.620 8013.437 124.831 8021 750 122.688

(N = 32) 3 8040.844 114.440 8019.156 114.283 8027.781 122.348

4 46 656 14.177 65.594 15.801 64.625 19.021

5 47 750 19.444* 65 625 28.014 66.656 27.645

Surface 1 7193 059 11.735 7192.559 12.446 7192.618 11.649

Grinder 2 7193 000 ll.702** 7191 912 12.001 7193.235 11.995

(N g 34) 3 7192 882 11.697** 7192.029 11.681 7192.412 11.259

4 7193 235 12.433 7191.500 12.188 7192.735 12.013

*p < .05

**p < .01

TABLE l6b.--Means and standard deviations of raw measurements (evens).

45

Judge

Task

Mean S.D Mean 5.0 Mean 5.0

1 10023 500 95.808 10022 250 96 695 10033 281 97 160

Horizontal 2 10025 844 96.355 10019 125 92 914 10028 281 97 473

Mill 3 6268 406 378 403 6273 125 377 672 6258 375 379 511

(N = 32) 4 6269 594 379 776 6272 750 377 569 6256 500 379 701

5 1457 687 513 807 1457 219 514 158 1454 187 515 331

6 1412 562 462 461 1413 812 462 089 1414 375 462 126

1 2451 118 233 393 2447 000 232 978 2452 412 233 853

2 2454 235 236 302 2450 118 233 760 2454 882 233.798

Vertical 3 1006 412 27.612 1006 678 28.017 1006 382 28.072

Mill 4 632 794 18.902 631 029 18.379 631 029 19.268

(N = 32) 5 632 176 15.392 630 912 14.958 630 647 15.206

6 634 382 14.625 633 118 14.614 634 735 19.048

7 1495 235 19.085 1495 588 19.113 1496 235 19.217

1 7704 529 392 551 7689 882 393 022 7701 176 392 490

Drill 2 17535 029 166 651 17552 618 168 649 17527 529 162 987

Press 3 17562 735 165 622 17576 294 164 943 17563 706 164 405

(N = 34) 4 10067 441 299 925 10077 912 306 188 10063 176 298 323

5 10095 882 292 665 10110 971 297 768 10094 324 300.948

6 97 735 4.147 98.059 3.629 97 735 4.266

1 20009 545 228 157 20014 485 215 148 20008 242 219 314

Lathe 2 8036 879 168 234 8029 727 169 593 8035 848 168 820

(N = 33) 3 8005 970 148 053 8022 909 142 520 8006 788 142 473

4 65 424 23 652 68.364 23.324 65 273 20.674

5 66 000 22 409 66 485 24.952 70.303 23.468

Surface 1 7191 853 13.855 7192 206 12.395 7191 676 11 846

Grinder 2 7191 618 13.844 7192 824 15.544 7191 324 12.084

(N = 34) 3 7188 059 30.086 7192 794 10.295 7189 588 17 119

4 7186 441 35 290 7192 353 11 430 7190 706 15.038

46

TABLE 17a.--Means and standard deviations of scored measurements (odds).a

Judge

Task 1 6 7

Mean S.D. Mean S.D. Mean 5.0.

(1) .853 .845 .853 .845 .882 .867

Horizonta] (2) .882 .857 .882 .867 .853 .845

M11] 3 .971 .785 1.000 .804 .794 .796

(N = 34) 4 1.000 .840 1.000 .767 .971 .822

5 1 176 890 1.118 .867 1.029 .857

6 1 175 821 1.205 .867 1.088 .818

1 941 838 971 .923 941 906

2 1 029 891 941 .906 941 .905

Vertical 3 1 118 796 1.324 .794 1 118 .758

Mill 4 1 059 998 941 .998 1 118 .993

(N = 34) 5 1 118 993 1 000 1.000 1 000 1.000

6 1 118 993 1.175 .984 1 118 .993

(7) 1 382 841 1.353 .836 1 265 851

1 882 993 1.059 .998 1 000 1 000

Uri], 2 794 867 .755 842 735 815

Press 3 755 769 .735 779 675 755

(N = 34) 4 547 723 .705 787 647 762

5 647 800 .706 787 575 794

5 1 529 848 1.500 849 1 529 813

(1) 1 281 .874 1.000 865 1 155 833

Lathe 2 1 031 .918 1 031 883 959 883

(N = 32) 3 906 .879 875 .781 875 857

4 313 .725 525 .927 688 950

5 313 .725 875 .992 875 992

(1) .457* .502 .543 .602 .571 .645

2:1:325 (2) .514*** .592 .500 .685 .629 .580

(N g 35) (3) .500 .800 .557 .715 .557 .754

(4) .543 .731 .557 .754 .543 .690

aDimensions in parentheses were used to test whether certain judges had

measurement biases.

*p < .10

**p < .05

***p < .025

47

TABLE 17b.--Means and standard deviations of scored measurements (evens).

Judge

Task 2 4 5

Mean 5.0. Mean 5.0. Mean S.D.

(1) 1.094 .843 1.125 .893 1.094 .843

Horizonta] (2) 1.094 .843 1.125 .857 1.031 .847

Mi11 3 .969 .809 1.094 .879 .969 .847

(N = 32) 4 .875 .740 .969 .883 .969 .809

5 1.063 .899 1.094 .914 1.094 .879

6 969 847 875 857 906 879

1 1.176 .856 1.206 832 1 147 912

2 1.147 .845 1.176 821 1.147 879

Vertica1 3 1.000 .908 1.000 907 971 923

M111 4 1.059 .998 1.059 998 1.059 998

(N = 34) 5 1.059 .998 1.059 .998 1.059 998

6 1.118 993 941 .998 1 059 998

(7) 1.000 939 912 .887 941 906

1 529 882 765 .972 647 936

Dri11 2 941 725 971 .857 1 000 767

Press 3 794 796 735 .740 824 785

(N = 34) 4 647 800 618 .768 618 768

5 676 794 618 .768 618 768

6 1 559 735 1 676 .629 1 588 771

(1) 879 844 .909 .900 758 818

Lathe 2 667 765 .788 .844 758 740

(N = 33) 3 818 796 .909 .900 788 807

4 606 919 .909 .996 848 988

5 424 818 .667 .943 485 857

(1) 735*** 779 .912 818 853 772

2:1:32: (2) 794 795 .765 .807 824 821

(N = 34) (3) 853 879 .882 .832 1 000 840

(4) 824** 856 .912 .818 912 887

*p < .10

**p < .05

***p < .025

48

the four caliper dimensions; eva1uator number 2 had the 1owest mean

measurement on three of the four micrometer dimensions. In order for

these biases to have any practica1 significance they not on1y need to

be consistent in a direction (high or 10w) for an eva1uator, but a1s0

must be of significant magnitude. Matched-pair t-tests were computed

on the data that had consistent biases. The biased eva1uators' measure-

ments were paired with the measurements of the eva1uator whose mean mea-

surement was in the midd1e of the other two measurements. None of eva1-

uator number four's biases were statistica11y significant; two of eva1-

uator number one's measurements were significant (surface grinder 1,

p < .10; surface grinder 2, p < .025); two of eva1uator number two's

measurements were significant (surface grinder 1, p < .025; surface

grinder 4, p < .05). It appears from the resu1ts of this study that

eva1uators' biases are not both consistent enough and high enough in

magnitude to warrant the conc1usion that there are biases in certain

eva1uator's measurements.

Certain eva1uators are not more s1oppy: Comparison of standard

deviations of the eva1uators on each measurement (Tab1e 16 and 17)

showed 1itt1e s1oppiness of the eva1uators. F-tests were done on a11

dimensions on variances of raw measurements and of scored measurements.

The tests on scored measurements produced no significant differences

between variances. Tests on raw measurements produced on1y a few.

Measurement number five on the vertica1 mi11 showed eva1uator number

49

six to be more s1oppy (p < .01). Measurement number five on the Tathe

showed eva1uator number 1 to be more s1oppy (p < .05). Measurements

three and four on the surface grinder showed eva1uator number 1 to be

more s1oppy (p < .01). Examination of the standard deviations revea1ed

no trends which may have indicated that a particular eva1uator was more

s1oppy than the others.

Coefficient a1phas by judges for the toTerance scores are found

in Tab1e 18. Using Fisher‘s r to z transformations, the highest coef-

ficient within each task was compared with the 1owest coefficient. No

significant differences were found, nor were any trends found by exami-

nation of the data that wou1d indicate one particu1ar judge as being

s1oppier than the other two.

TABLE 18.-~Coefficient a1pha by judge for toTerance scores.

—’¥ ‘-:‘ ‘-

‘ i r“

“-12:11“ 2:11: 2:23:32:

1 .68 .76 .63 .23 .89

Odds 6 .67 .74 .55 .46 .91 34

7 69 72 .60 41 91

2 86 78 .76 62 91

Evens 4 .87 .73 .72 .51 .91 32

5 .82 .74 .76 .49 .93

50

Finally, examination of the inter—judge re1iabi1ities

(Tab1es 9, 10, 11, and 12) revea1s no trends that wou1d indicate a

particu1ar judge is s10ppier than the two other judges with whom he

was compared. Because no judge was consistent1y s1oppier, no signifi-

cance test on the reported correiations was carried out. The conc1u-

sion must therefore be drawn that measurement errors are random and

equai between judges in the 1ong run.

Measurements do not become more re1iab1e over time: The data

presented in Tab1e 19 does not indicate that measurements become more

re1iab1e over time. The on1y possibie exception to this is the two

measurements (3 and 4) on the horizonta1 mi11, where eva1uators had to

use siiding para11e1s to measure the width of the sTot; -.27 is sig-

nificant at the .05 1eve1 and -.24 is significant at the .07 1eve1

(both one-taiTed tests).

Interna1 Consistency Re1iabi1ity

Interna1 consistency re1iabi1ity is shown in Tab1e 20. Within

each machine by to1erance and by finish, interna1 consistency is fair1y

high. The on1y deviation from this is the re1iabi1ity of the 1athe

to1erance measurements. This finding can easi1y be exp1ained by the

1ack of inter-judge re1iabiiity on this machine, which has a direct

bearing on the extent of interna1 consistency re1iabi1ity. 0vera11

51

TABLE l9.--Correlations of absolute deviations of scored measurementsa

with time.b

5.....1... “°"Ji??“‘ “1:11“ 3:211 Lathe 2:21.32?

1 .16 .14 -.22 —.03 .12

2 -.03 .11 -.09 .04 .17

3 -.27** .13 -.10 .06 .10

4 -.24* .25 .00 .09 .09

5 - . 08 .13 - .04 -.15

6 -.15 .13 -.11

7 -.02

a

Absolute deviations of scored measurement = measurement of evaluator

one minus measurement of evaluator two.

bNegative values indicate less slop with time.

Positive values indicate more slop with time.

*p < .10

**p < .05

52

TABLE 20.--Interna1-consistency reliability.

Tolerance & Finish

Machine Tolerance Finish

2 itema Multi-itemb

Horizontal Mill .76 .78 .44 .82

Vertical Mill .76 .84 .49 .84

Drill Press .65 .69 .55 .76

Lathe .45 .63 .49 .65

Surface Grinder .91 .77 .27 .86

Total (by machine) .56 .59

Total (by dimension) .76 .66

Total--2 items: total tolerance; total finish .74

Total--5 items: (total tolerance + finish) X 5 machines .70

Total--10 items: (total tolerance) X 5; (total finish) X 5 .73

Total-~38 items: by tolerance dimensions (28); by finish

dimensions (10) .82

aTwo items are finish score and tolerance score.

bItems are each measurement and evaluation within tolerance and

finish.

53

reliability (of the total test) is also high. The actual internal

consistency reliability of the total test lies somewhere between .70,

which is the reliability of a five item test where each machine (toler-

ance plus finish) is an item, and .82, which is the reliability of a

38 item test where each tolerance measurement and each finish evalua-

tion are items. An inflated estimate of reliability is .82 because of

correlated errors within each task. The two item test and the ten item

test have a smaller degree of correlated errors, and the five item test

has no correlated errors. An underestimate is .70 because the test was

actually longer than five items.

Correlation Matrix

Data in the correlation matrix in Table 21 can be used to

assess the relationship of performance on one machine to the next and

relationship of tolerance to finish. The overall correlation between

tolerance and finish was .59. Correlations between machines on toler-

ance and on finish are very similar. The surface grinder had the

smallest relationship with the other machines. This finding may change

if a surface grinder were to be used that was in better condition than

the one used in this study.

TABLE

21.-Intercorre1ations

of

scores

within

the

performance

test.

NM r—NMQ'LD r-NMG'LO r—NMQ'LO

1.00

.59

.75

.61

.34

.11

1.00

1.00

.41

.11

.81

.33

.36

.05

1.00

.42

.22

.33

.35

.83

.21

.20

.50

.18

.83

.47

.14

1.00

.12

.36

.39

.38

.81

.13

.31

.07

.31

.81

.05

1.00

.18

.15

.14

-.08

.76

.29

—.03

.23

.28

.76

1.00

.34

1.00

.28

1.00

.27

.30

.13

.28

.33

.28

.29

.14

.17

.14

.27

.32

.30

.46

.09

.22

1.00

19

-.05

1

.45

.16

.07

.38

.41

.02

.00

-

.14

.32

-.08

.00

.31

.04

.15

.26

.15

.13

1.00

Performance

Test

1)

Total

tolerance

plus

finish

score

2)

Total

tolerance

score

3)

Total

finish

score

Tolerance

Plus

Finish

Scores

for:

1)

Horizontal

Mill

2)

Vertical

Mill

3)

Drill

Press

4)

5)

Lathe

Surface

Grinder

C.

Tolerance

Scores

for:

1)

Horizontal

Mill

2)

Vertical

Mill

3)

Drill

Press

Finish

Scores

for:

1)

Horizontal

Mill

2)

Vertica1

M111

3)

Drill

Press

AA

QLO

Lathe

Surface

Grinder

Lathe

Surface

Grinder

54

55

Test-Retest Reliability

Stability across time, as can be seen in Table 22, is virtually

nonexistent. No correlations between time one and time two were very

high, and the overall picture is one of no test-retest reliability for

the time interval in the present study. However, as pointed out earlier,

this time interval was far too long to draw any conclusions about

whether or not this performance test does in fact have test-retest

reliability.

TABLE 22a.--Test-retest reliability by machine.

Machine To1erance Finish

Horizontal Mill _ .04 .34

Vertical Mill .47 -.10

Drill Press .41 .00

Lathe -.05 .16

Surface Grinder .23 .15

Total -.07 .33

56

TABLE 22b.--Test-retest reliability by dimension.

Dimension Tolerance Finish

Horizontal Mill .32

-.40

.26

.07

.00

Vertical Mill

.‘-------------------—----- ---------

Drill Press

‘---------------‘--------------—---.

------------- -----------------------

Surface Grinder

DISCUSSION

Evaluating_the Evaluators

The results of this study bear out the hypotheses that non-

tradesmen can reliably evaluate (measure and judge for finish) the end

products of a test for machinists in the metal trades. The fact that

the questions concerning sloppiness and biases on the part of evalu—

ators were not affirmed by the data lends further support to the use

of non-tradesmen in evaluating this kind of performance test. Those

measurements which did not have inter-judge reliabilities as high as

most of the others, and the measurers who did not have reliabilities

as high as the others, point out the need for training of and feedback

to the eva1uators. The fact that the measurers of the odd-numbered

workpieces did not have as high agreement as those of the even-numbered

workpieces could possibly be explained by differences in the workpieces

themselves--the odd-numbered ones just by chance were harder to measure.

But another likely explanation is that these evaluators were not suffi-

ciently trained. The process of training involves showing the evalu-

ators how to use and read the instruments, showing what part of the

workpiece to measure, and placing a psychological emphasis on such

57

58

factors as being careful or meticulous in making measurements and mea—

suring in a standard or consistent manner. Only a small amount of

effort was put into the process of training the evaluators to evaluate

as accurately as possible. Had more of an effort been put into train-

ing the eva1uators, inter-judge reliability may have been higher, and

equally high for all measurers and measurements. However, there was no

way of empirically showing from the data in this study that an insuffi-

cient training effort was a major cause of unreliability in the mea-

surements.

The data, in general, did not support the question of evalu-

ators becoming more reliable over time. The probable explanation for

this is also related to training. Evans (1951) reported that feedback

to evaluators about the accuracy of their measurements is essential if

inter-judge agreement is to remain high. Evans found that raters im—

prove in sets of measurements in immediate succession (eg., set 1 has

the least accurate measurements; set 2 has intermediate accuracy; set 3

has the most accurate measurements). Feedback came immediately after

each set. However, when there was a long interval (more than one hour)

between sets of measurements accuracy went down, and when there was a

very long interval (l0 days) accuracy was worse than in the beginning

(immediately following an initial 30 minute training period). In light

of EVans' findings, it is not surprising that evaluators did not become

more reliable over time. Little or no feedback was given to evaluators,

59

and measurements were not all done in immediate succession. Any prac—

tice effect that may have been operating was probably offset by these

two factors, resulting in no negative correlations of absolute devia—

tions of evaluators with time.

Most of the same factors that operate on tolerance evaluations

also operate on the finish judgments, with one important addition being

an element of subjectivity. Finish evaluations were not measured, but

instead were subjective judgments on how smooth the workpiece was. The

process was made less subjective by introducing the system of bench-

marks, which enabled evaluators to make relative, rather than absolute,

judgments. The eva1uator's major difficulties, and probably the major

source of disagreement, came when a workpiece fit between two cate-

gories. One way to make finish evaluations more reliable would be to

have more benchmarks within each category. The evaluator would thus be

more likely to find a benchmark that would be closer in finish to the

workpiece he or she is evaluating. One final explanation of why finish

reliability was not as high as tolerance reliability is that finish was

evaluated long after tolerance, allowing some workpieces to become

rusty.

The results of this study pertaining to inter-judge reliability

are similar to those reported in the literature. Inter-judge relia-

bility in this study appears to be about the same as that found by

Bornstein et al. (1957) as shown in Table 9. Siegel (1954a) and

60

Stuit (1947) report inter-judge reliabilities in the .90‘s. While most

of the reliability coefficients in the present study reached the .90‘s,

there were many exceptions to this. Kelly (1955) reported on subjec-

tive kinds of judgments, very similar to the finish scores in this '

study. However, she used a method of paired-comparisons of 10 glass

panels which resulted in rank-orderings of the panels. This was re-

peated twice for each judge and rankings within each observer were

correlated. Therefore, there is no way of directly comparing Kelly's

results with the present study. Kelly's conclusion was that these

judgments can be made reliably, a conclusion which is in agreement with

the results of this study.

Evaluating the Test

Coefficient alphas for tolerances by machine and for finish by

machine were at least adequate, with the exception of the lathe. The

low internal consistency reliability of the lathe was explained in the

results section by the fact that inter-judge reliability was low. The

surface grinder had the highest tolerance reliability. This would be

expected because the surface grinder task was by far the least complex,

and also had the greatest degree of correlated errors. No layout was

necessary on this task and only one grinding process was called for.

Very high coefficient alphas were not expected, nor were they desirable,

61

across tolerance and finish or across machines. Had alphas been high,

it would have been an indication that the performance test was only .

tapping one dimension. Tolerance and finish, and ability on each

machine were expected to be related to some extent but not extremely

highly related. 50 alphas were not expected to be very high. As can

be seen in Table 16, correlations between machines (tolerance plus

finish) and correlations between tolerance and finish were moderately

positive (significant in most cases), but not extremely high. The

pattern of correlation did not differ much from those reported by

Campion (1972), reported in Table 1, or by Siegel and Jensen (1955),

reported in Table 5, but were generally higher than those reported by

Bornstein et al. (1957), reported in Table 2.

Overall reliability of the performance test (total including

tolerance and finish on all five machines) is estimated to be about

.76. It would be somewhere between alpha of a five item test (.70)

where total score on each machine is an item, and a 38 item test (.82)

where all tolerance and finish dimensions are items. The former is

an underestimate because the test is in fact longer than five items,

and the latter is an overestimate because the errors made in many of

the 38 items are correlated with errors made in other items. Alpha of

the performance test used in this study is slightly higher than the

alphas calculated from the data in the studies of Campion (1972),

Bornstein et al. (1957), and Siegel and Johnson (1955), which were

62

.40, .62, and .59, respectively. Here, the appropriate comparison

would be with .70, the alpha of a five item test.

Evaluating_the Test (The

Second Time Around)

No evidence of test-retest reliability was found in this study.

Low test—retest reliability can often be explained by restriction in

range. In this study, however, the lack of stability cannot be ex—

plained by a restriction in range problem. The 21 retest subjects did

not differ in variance from the original 68 subjects on the first

testing, nor was the variance restricted on the second testing, which

nay have been expected as a result of practice that would create a

ceiling effect. Due to constraints placed on the researchers by the

participating company, and also due to practical considerations by the

researchers, the time period between test and retest was far too long

for all testees. The range was from 7 to 20 weeks, averaging about

14 weeks. The ideal period between test administrations should have

been one to two weeks. Any longer period of time would result in

differential learning and practice on the different machines. If a

person has not had any practice on a machine for three or four months,

then he would undoubtedly be rusty on that machine. If another person

had had many hours of experience on a particular machine, then his

score would be expected to improve. The training program for the

63

apprentices in the retest company lacks any kind of uniformity across

apprentices, so some of the subjects had no experience, while others

had experience on one or more of the machines in the interim. (This

lack of uniformity was confirmed through examination of machine time

logs of the various retest subjects for the interim period.) With

these conditions, this performance test cannot be expected to have

high stability over long periods of time.

Conclusions

The performance test in this study was one constructed to be

reliable across judges, to be internally consistent, and to be stable

over time. The data showed the test to be reliable across judges and

to be internally consistent, but not stable over the extreme time in-

terval that was employed. Further research needs to be conducted to

examine retest reliability. The interim period should be from one to

two weeks if one is to expect a fair degree of stability. A larger

sample would also be a desirable component of any further research on

retest reliability. All subjects who were tested the first time,

rather than only volunteers, should be tested the second time.

While inter-judge agreement was high in almost all respects,

careful training of evaluators and a feedback system should be neces-

sary elements in the use of performance tests. More extensive training

64

is especially important where judgments are somewhat subjective in

nature. (In this study it was the benchmark system for evaluating

finish.)

The generalization of results of this study to performance

testing in occupations other than the metal trades is speculative and

in need of further research. However, it appears that a test where

end products are what are being evaluated can be scored reliably and

can be constructed to yield adequate reliability.

A number of implications can be drawn from this study which

pertain to the use of performance testing by government, industry, or

vocational counseling. First, this study takes a thorough look at

reliability issues involved in the construction of a performance test--

more thorough than past studies found in the literature. Future re-

search should contain in it all elements of reliabilities that are con-

tained in this study. The results of this study lend support to the

feasibility of using performance tests--non-tradesmen were able to score

the test reliably at their own convenience. The use of performance

testing by government in occupational licensing or in selection and

placement of job applicants can only be justified if such performance

tests are reliable. This study lends support to use in these areas--

this kind of test can be constructed to be internally consistent and

to be scored reliably. Use of this kind of test by governmental

agencies may result in improvements in their own functions and

65

effectiveness. Industry can use such a test to evaluate training pro-

grams or to evaluate individual progress (provided that retest relia-

bility is established). The fact that reliable performance tests can

be constructed lends support to the performance testing movement, which

argues that tests which sample job skills are often more valid and

fairer to minorities than traditional paper-and-pencil tests.

APPENDICES

APPENDIX A

TRANSCRIPTS 0F TAPED INSTRUCTIONS AND

BLUEPRINTS OF TASKS

69

Lathe Instructions

These are the instructions for the Lathe Task 2. Please listen

to these instructions in their entirety before starting this task. The

instructions are as follows:

Examine the workpiece drawing. It will be labeled ”Lathe

Task 2." You are to face the workpiece to the length cited. You are

also to bore it out to the dimension shown. And finally, you are to

chamfer one end of the workpiece both internally and externally as in-

dicated. Do your best to stay within the tolerances shown.

Your performance in this task will be evaluated on two dimen-

sions: quality and speed. You should therefore work as quickly as you

feel is possible to turn out high quality work. You will have only one

workpiece.

When you finish, place your identification sticker on the work-

piece and place the workpiece in the bin provided. Press the sticker

on carefully, since it sometimes does not stick well.

1 Next, clean up the machine, leaving it in the condition you

found it for the next person.

This completes the instructions. If you need to, you may re-

wind the tape and listen to any portion of the instructions again.

When you are completely finished, rewind the tape and turn off

the tape recorder. Then report to the administrator for assignment to

another machine.

This is the end of the tape.

HORIZONTALANDVERTICAL

MILL

TASKS

+1.5"

+

.005

++1"

i.005+

400+

9

T6

12:332

"I":

'T'.

.750"

+.000

.I.--------

‘-l..--

-0002

U 1

L..-...J

+1

“N

No

I.291

+1"

1.002+.302

-.000

(k—

—.

2"

i.003——ai

+H°+

71

1"

i51.2.

82°

CSK

n-+-010

.750

_.001

DIA.

HOLE

I L—l.750"

i.005

-Ol

+.OOO

-.002

lV

nl

Au

.

L'_3

in?

’L—2:,333-——>

\j—— --

1

'1'L-..-

I

T----

1----..

l

——|'—

I.750"

J l

72

coc.

'74

441.

3a.:

.

j

moo.

H

:H

'-

xmfi.«mezzo

much—am.

APPENDIX B

TOLERANCE EVALUATION SHEETS,

FINISH EVALUATIONS, AND

EXPLANATION OF DIMENSION NUMBERS

Testee

Evaluator

Horizontal Mill Task

Tolerance Evaluations

Instructions: Using the appropriate instruments, measure and record

each of the workpiece dimensions specified on this sheet. Then place

a circle around the most stringent of the below listed tolerances which

that dimension meets.

Location of Slot measured at end "L" (use calipers)

1" i .002 1" i .004 Neither g1

H3’5?““83

Location of Slot measured at end "R'I (use calipers)

l" i .002 l" i .004 Neither

Width of Slot measured at end "L" (use sliding ll‘s)

(Do not rest sliding parallels on bottom of slot)

.525" : '883 .525" t '33: Neither

Width of slot measured at end "R" (use sliding ll's)

" + .002 u + .004

‘625 - .000 ‘625 - .002

Thickness of unmilled portion of workpiece at end "L" (use

micrometer) (A)

Thickness of milled portion of workpiece at end "L” (use

micrometer) (3)

Depth of Slot at end "L" (A18)

.125" i .002 .125" i .004 Neither

Thickness of unmilled portion of workpiece at end "R" (use

micrometer) (C)

Thickness of milled portion of workpiece at end "R" (use

micrometer) (D)

75

76

Depth of Slot at end "R" (C-D)

.125" i .002 .125“ i .004 Neither

Is the Slot in the correct location? Yes No

Was the cut started at another location on the workpiece? Yes No

Was a deeper or wider cut started but not completed? No Deeper Wider

77

Testee

Evaluator

Vertical Mill Task-~Milling a Pocket






Thickness of unmilled workpiece near left end of pocket (use

micrometer) , (A)

Thickness of stock at left end of pocket (use micrometer) (B)

Depth of pocket at left end (A-B)

.. + .002 .. + .004 ..250 _ .000 .250 _ .000 Ne1ther

Thickness of unmilled workpiece near right end of pocket

(use micrometer) (C)

Thickness of stock at right end of pocket (use micrometer) (D)

Depth of pocket at right end (C-D)

u + -002 u + .004 .

.250 _ .000 .250 _ .000 Ne1ther

Length of pocket (use calipers)

l" i .005 l" i .010 Neither

For width of pocket take width of unmilled stock minus width of

milled part (use calipers).

Width of pocket at end "L"

.625 i .005 .625 i .010 Neither

Width of pocket at middle ,

.625 i .005 .625 i .010 Neither

Width of pocket at end "R"

.625 i .005 .625 i .010 Neither

78

Location of pocket from end "L“ (use calipers)

1.5” i .005 1.5" i .010 Neither

Radius of cutter that was used (use shank of 3/8" end mill cutter)

3/16“ Something else

Is the pocket in the correct location? Yes No

Wes an incorrect cut started at another location on the workpiece?

Yes No

Was a wider or longer cut started in the pocket but not completed?

Yes No

79

Testee

Evaluator

Drill Press Task






Hole Diameter (use telescoping gauge and "mike") (A)

.750" + '0‘”_ 001 Meets 3 Less Stringent Tolerance

Distance from edge of hole to end "L" as measured from top of work-

piece (use calipers, inserting them from the countersink side of the

workpiece) (8)

Distance from edge of hole to end "L" as measured from bottom of

workpiece (use calipers, inserting them from the side of the work-

piece that hasl:ot been countersunk) (C)

Distance from center of hole to end "L" as measured from top of

workpiece (B+l/2A)

1.750" t .005 1.750" i .010 Neither

Distance from center of hole to end "L" as measured from bottom of

workpiece (C+l/2A)

1.750" t .005 1.750" t .010 Neither

Distance from edge of hole to end "C" as measured from top of work-

piece (D)

Distance from edge of hole to end "C" as measured from bottom of

workpiece (E)

Distance from center of hole to end "C" as measured from top of

workpiece (D+l/2A)

l" i .005 l" t .010 Neither

80

Distance from center of hole to end "C" as measured from bottom of

workpiece (E+l/2A)

l“ t .005 l" i .010 Neither

Countersink Dimension (convert to decimal)

l" i 1/32 Too Narrow Too Deep

Was an 82° Countersink used? (Insert 82° Countersink in hole) Yes No

81

Testee

Evaluator

Lathe Task II--Boring, Facing, and Chamfering






Length of Workpiece: . Avg. .

(Using the calipers, make several measurements, rotating the workpiece

approximately 120° after each, and record them above. Then compute the

average of these readings and record it. Use this average reading for

purposes of specifying the most stringent of the below listed length

tolerances which the workpiece meets. When making the individual

caliper measurements be sure the calipers are placed close to the edge

of the workpiece and away from the center axis point.)

2" i .001 2" i .002 Neither

Diameter of Bored Hole at end "L" (use telescoping gauge)

+ .001 + .002 ..800 _ .000 .800 _ .00] Ne1ther

Diameter of Bored Hole at end "R“ (use telescoping gauge)

+ .001 + .002 ..800 _ .000 .800 _ .00] Ne1ther

Inside Chamfer (use scale)

1/16" i 1/64 Meets a Less Stringent Tolerance

Outside Chamfer (use scale) . (convert to decimal)

1/16" i 1/64 Meets a Less Stringent Tolerance

Are the inside and outside chamfers on the same end of the workpiece?

Yes No

82

Testee

Evaluator

Surface Grinder Task


Instructions: Using a dial micrometer, measure and record each of the

workpiece dimensions specified on this sheet. Then place a circle

around the most stringent of the below listed tolerances which that

dimension meets.

Thickness of Workpiece as measured at Corner "A"

.719'| i .0002 .719" i .0004 Neither

Thickness of Workpiece as measured at Corner "B"

.719" i .0002 .719" i .0004 Neither

Thickness of Workpiece as measured at Corner "C"

.719" i .0002 .719" i .0004 Neither

Thickness of Workpiece as measured at Corner "0"

.719" i .0002 .719" t .0004 Neither

Were both the top and bottom sides of the workpiece ground? Yes No

If not, what was ground?

Note: In making these measurements be sure to place the micrometer in

far enough to avoid burrs on the edges of the workpiece.

Finish Evaluations

Instructions:

pieces to be used as benchmarks on the table in front of you.

83

Testee

Evaluator

For each finish evaluation, place the numbered work-

Be sure

to include all of the numbered workpieces in each benchmark category.

Identify the benchmark category most closely represented by the work-

piece you are evaluating.

in the space provided.

HORIZONTAL AND VERTICAL MILL TASKS

Finish of Pocket Floor:

Write the number of this benchmark category

Benchmark Benchmark Benchmark Benchmark Benchmark

Category #1 Category #2 Categggy,#3 Category #4 Category #5

w.p. #48 w.p. #45 w.p. #46 w.p. #42 w.p. #39

Finish of Sides of Pocket:

Benchmark Benchmark Benchmark Benchmark Benchmark

Category #1 Category #2 Category #3 Category #4 Category,#5

w.p. #26 w.p. #46 w.p. #41 w.p. #49 w.p. #32

Finish of Slot Floor:

Benchmark Benchmark Benchmark

Categony #1 Categgry_#2 Category #3

w.p. #39 w.p. #28 w.p. #34

w.p. #48 w.p. #36 w.p. #35

84

Testee

Evaluator

Finish Evaluations

Instructions: For each finish evaluation, place the numbered work-

pieces to be used as benchmarks on the table in front of you. Be sure



piece you are evaluating. Write the number of this benchmark category


DRILL PRESS TASK

Countersink Finish:

Benchmark Benchmark Benchmark Benchmark

Category_#l Category #2 Category #3 Categgry #4

w.p. #28 w.p. #34 w.p. #42 w.p. #36

w.p. #44 w.p. #72 w.p. #54 w.p. #49

w.p. #47 w.p. #62

w.p. #51 w.p. #63

Finish of Hole: (Note: If a ridge or line appears on the hole wall,

indicate the next lowest benchmark category.)


Category #1 Categoryg#2 Category #3

w.p. #36 w.p. #58 w.p. #42

w.p. #70 w.p. #60

85

Testee

Evaluator

Finish Evaluations







LATHE TASK

Finish of bored hole: (Note: This is evaluated by both touch and

sight.)


Category_#l Category #2 Category #3

w.p. #11 w.p. #58 w.p. #10

w.p. #12

Finish of nonchamfered end:


Category_#l Category,#2 Category #3 Categgryg#4

end not faced w.p. #3 w.p. #58 w.p. #9

w.p. #11 w.p. #59 w.p. #56

Finish of chamfer:


Category #1 Category_#2 Category #3

w.p. #11 w.p. #3 w.p. #7

w.p. #12 w.p. #8

86

Testee

Evaluator

Finish Evaluations







SURFACE GRINDER TASK

Finish of nonlabeled side: (Note: This is evaluated by running your

fingernail widthwise across the workpiece.)


Category_#l Category #2 Categgry #3 Category #4

w.p. #29 w.p. #28 w.p. #48 w.p. #35

w.p. #37

Chatter on nonlabeled side: (Note: This is evaluated by sight. Tilt

workpiece so that if reflects light. Look

for the extent to which wavey lines appean)


Category #1 Category #2 Category #3

w.p. #29 w.p. #43 w.p. #37

w.p. #53

Task

Horizontal Mill

Horizontal Mill

Horizontal Mill

Horizontal Mill

Horizontal Mill

Horizontal Mill

Vertical Mill

Vertica1 Mill

Vertical Mill

Vertical Mill

Vertical Mill

Vertical Mill

Vertical Mill

Drill Press

Drill Press

Drill Press

87

Explanation of Dimension Numbers

Dimension

No.

1

wwow-b

Explanation

Location of slot from edge of workpiece to

edge of slot at end "L" using calipers to

make measurement

Location of slot at end "R"

Width of slot at end "L" measured by in-

serting s1iding parallels in the slot and

measuring sliding parallels with the

micrometer

Width of slot at end "R"

Depth of slot at end "L" using micrometer

Depth of slot at end "R”

Depth of pocket at left end using microm-

eter

Depth of pocket at right end

Length of pocket using inside part of

calipers

Width of pocket at end "L" using calipers

Width of pocket at middle

Width of pocket at end "R"

Location of pocket (edge of pocket from

side of workpiece) at end "L"

Hole diameter using telescoping gauge and

measuring it with a micrometer

Distance from center of hole to end "L”

(side of workpiece) using calipers on top

of workpiece (countersunk side)

Distance from center of hole to end "L”

measured from bottom of workpiece

Task

Drill Press

Drill Press

Drill Press

Lathe

Lathe

Lathe

Lathe

Lathe

Surface Grinder

Surface Grinder

Surface Grinder

Surface Grinder

Dimension

No.

4

88

Explanation

Distance from center of hole to end "C"

using calipers on top of workpiece

Distance from center of hole to end ”C"

measured from bottom of workpiece

Countersink dimension (diameter) measured

with a scale

Length of workpiece (average of three

measurements) using calipers

Diameter of hole at end "L" using tele-

scoping gauge and micrometer

Diameter of hole at end "R”

Inside chamfer measured with a scale

Outside chamfer

Thickness of workpiece measured as close

to corner "A" as possible without over-

lapping corner in order to avoid measuring

burrs

Thickness of workpiece at corner ”B"

Thickness of workpiece at corner "C"

Thickness of workpiece at corner ”D"

The letters in quotation marks ("LJ'"R," "C," "A," "B," "D,") refer to

labels placed on the workpieces to standardize measurements (in the case

of the surface grinder) or to make the measuring process easier and to

avoid making errors in measuring the wrong parts of the workpieces.

APPENDIX C

EXPLANATION OF SCORING SYSTEM

AND THO POINT SYSTEM

EXPLANATION OF SCORING SYSTEM

Tolerances were decided on with the help of the project's

machinist consultant. As can be seen on the evaluation sheets in

Appendix C, there were two tolerances for most measurements-~the first

tolerance, which is identical to that shown on the blueprints, and a

second tolerance which is not as stringent as the first. Dimensions

were to be scored 2, l, or 0, depending on whether the testee fell into

the first tolerance, the second tolerance, or outside of the second.

It was realized before subjects were tested that distributions within

each preset tolerance may not be ideal, and that a revision of the

tolerances based on the distributions of the real measurements would

be necessary. These revisions were done, with the original tolerances

as well as the distributions taken into consideration. This scoring

system appears on the following pages. Entered into the table are

number of measurements to be scored of that particular dimension, the

dimension (corresponding to the dimension on the evaluation sheets),

and tolerances underneath number of points to be assigned for that

tolerance. In parentheses are the approximate percentage of testees

falling into that tolerance category.

In the analyses reported in this study, the yes~no questions

(found at the end on each evaluation sheet in Appendix B) were not

89

90

scored. There was little or no variance in these items--very few

testees used the wrong tools or started machining on the wrong part

of the workpiece. Furthermore, no weighting system could be decided

on for these items. Because they added almost nothing to the test,

they were dropped from the analysis.

#_of

meas.

NFPP—

2POINT

SYSTEM

Horizontal

Mill

dimension

2points

location

1.000

+1

.002

(34.8)

.002

.000

.002

(33.3)

width

.625

+1

(25.8)

+1

depth

.125

correct

location

another

cut

deeper

orwider

Vertical

Mill

.002

..000

length

1"

.005

(39.7)

width

.625

location

1.5

.005

(48.5)

radius

3/16"

correct

location

depth

.250

+1

(29.4)

+1 +1

incorrect

cut

started

wider

or

longer

cut

no

1point

+1 +1 +1 +1 +1 +1 +1

.008

(17.9)

.006

(40.8)

.004

(33.4)

yes

no

110

.005

.002

.011

(30.9)

.008

(45%?)

.010

(14.7)

(32.4)

yes

110

0points

other

(47.3)

(33.4)

(33.3)

yes

yes

(38.2)

(29.4)

(55%?)

(26.8)

no

yes

yes

91

#of

meas.

NNF—l—

dimension

diameter

distance

to

"L"

distance

to

"C"

c/s

82°

length

diameter

chamfer

same

end

2POINT

SYSTEM

(cont.)

Drill

Press

2points

.010

.001

1.750

.005

(25)

1.000

.005

(20.6)

1.000

.032

+1

.750

(60.3)

+1 +1

Lathe

.001

(46.1)

000]

.000

.06251:.0157

(66.1)

2”

+1

.8"

+1

(38.5)

‘IID

1point

+1

.015

(25)

.015

(20)

-.064

+1 +1

.003

(14.4)

.004

.003

+1

(23.0)

yes

0points

other

(39.7)

(50.0)

(59.4)

too

narrowor

deep

no

(28.5)

(29.5)

other

(23.9)

110

92

2POINT

SYSTEM

(cont.)

Surface

Grinder

#of

meas.

dimension

3points

2points

1pgint

0points

4thickness

.719

i.0002

(14.7)

i.0004

(16.2)

+1

.0006

(23.5)

(45.6)

1top

and

bottom

yes

no

ground

93

REFERENCES

REFERENCES

Adkins, D. C., Primoff, E. 5., McAdoo, H. L., Bridges, C. F., and

Forer, B. Construction and analysis of achievement tests.

Washington, D.C.: U.S. Government Printing Office, 1947.

Besnard, G. G., and Briggs, L. J. Measuring job proficiency by means

of a performance test. In E. A. Fleishman (Ed.), Studies in

personnel and industrial psychology. Homewood, Ill.: Dorsey,

1967.

Blood, M. R.. Job samples: A better approach to selection. American'

Psychologist, 1974, 22, 218-219.

Bornstein, H., Jensen, B. T., and Dunn, T. F. The reliability of

scoring in performance testing as a function of tangibility of

the performance product. Abstract of paper read at the 1954

APA Convention. American P§ychologjst, 1954, 9, 336-337.

Bornstein, H., Jensen, B. T., Goldstein, L. G., and Dunn, T. F. Tech-

nical research note 75 evaluation of the basic military per-

formance test. Washington, D.C.: Department of the Army, The

Adjutant General's Office, Personnel Research and Procedures

Division, June, 1957.

Boyd, J. L., Jr., and Shimberg, B. Handbook of performance testing.

Princeton, N.J.: Educational Testing Service, 1971.

Campion, J. E. Work sampling for personnel selection. Journal of

Applied Psycholggy, 1972, 56, 40-44.

Cole, N. S. The right question, wrong answers. American Psyehologist,

1974. 22, 219-220.

Dunn, T. F., Bornstein, H., Jensen, B. T., and Tye, V. M. A group

administered performance test of Army basic skills. Abstract

of paper read at 1954 APA Convention. American Psychologist,

1954, 2, 357.

94

95

Equal Employment Opportunity Commission. Guidelines on Employment

Selection Procedures. Washington, D.C., 1970.

Evans, R. N. Training improves micrometer accuracy. Personnel Psy-

chology, 1951, 4, 231-242.

Frederiksen, N. Factors in in-basket performance. Psychological

Monographs: General and Applied, 1962, Q (22) (Whole No. 541).

Gael, S. 0n O'Leary's "Fair employment . . . ." American Psycholo-

gist, 1974, 29, 216-217.

Gordon, J. E. Testing, counseling and supportive services for disad-

vantaged youth. Ann Arbor: Institute of Labor and Industrial

Relations, University of Michigan, 1969, 54—62.

Griggs, et al., vs. Duke Power Company. Supreme Court of the United

States, No. 124, October Term, 1970 (March 8, 1971).

.Havron, D. M., Lybrand, W. A., and Cohen, E. The assessment of in-

fantry rifle squad effectiveness. U.S. Army Personnel Research

Branch, The Adjutant General's Office, Technical Research Re-

port 1087, December, 1954.

Jewish Employment and Vocational Service. Work Samples: Signposts on

the road to occupational choice. Final Report to Manpower

Administration, U.S. Department of Labor. Experimental Demon-

stration Contract No. 82-40-67-40, September 30, 1968.

Jewish Employment and Vocational Service. Job Trials for Personnel

Selection. Final Report to Manpower Administration. U.S.

Department of Labor, Contract No. 82-42-72-08, March 15, 1973.

Kelly, M. L. A study of industrial inspection by the method of paired

comparisons. Psychological Monographs: General and Applied,

1955, §g_(9) (Whole No. 394).

Lawshe, C. H., Jr., and Tiffin, J. The accuracy of precision instru-

ment measurement in industrial inspection. Journal of Applied

Psychology, 1945, 22, 413-419.

McClelland. D. C. Testing for competence rather than for "intelli-

gence." American Psychologist, 1973, 28, 1-14.

96

McPherson, M. W. A method of objectively measuring shop performance.

Journal of Applied Psychology, 1945, 22, 22-26.

Office of Federal Contract Compliance. Regulations on Employee Testing

and other Selection Procedures. Washington: U.S. Department

of Labor, 1971.

O'Leary, L. R. Fair employment, sound psychometric practice and re-

ality. American Psycholggist, 1973, 28, 147-150.

Robins, A. R., Rog, H. L., and de Jung, J. E. Assessment of NCO leader-

ship (Test Criterion Development), U.S. Army Personnel Research

Branch, The Adjutant General's Office, Technical Research Re-

port IIII, July, 1958.

Ronan, W. W. and Prien, E. P. Toward a criterion thepry: A review and

analysis of research and opinion. Greensboro, N.C.: Richard-

son Foundation, Inc., 1966.

Scheuer, W. Performance testing in New Jersey. Good Government, 1970,

87, 5-15.

Schmidt, F. L. A pilot study for the evaluation of procedures for the

construction of performance measures in the skilled trades and

technical occupations. Proposal submitted to Development

Systems Corporation, Chicago, 111., Michigan State University,

1972.

Schmidt, F. L., Greenthal, A. L., Berner, J. G., Hunter, J. E., and

Williams, F. M. A performance measurement feasibility study:

Implications for manpower policy. Final report to Manpower

Administration, U.S. Department of Labor, Contract No.

82-17-71-48, Sept. 30, 1974 (Subcontract of Development Sys-

tems Corporation, Chicago, Ill.).

Shimberg, B., Esser, B. ., and Kruger, D. H. Occupational licensing:

Practices and policies. Washington, D.C.: Public Affairs

Press, 1972.

Siegel, A. I. The check list as a criterion of proficiency. Journal

of Applied P§ycholggy, 1954a, 93-95.

97

Siegel, A. I. Retest-reliability by a movie technique of test admin-

istrators' judgment of performance in progress. Journal of

Applied Psychology. 1954b, 88, 390-392.

Siegel, A. I. Interobserver consistency for measurements of the in-

tangible products of performance. Journal of Applied Psy-

chology, 1955, 88, 280-282.

Siegel, A. I. and Jensen, J. The development of a job sample trouble-

shooting performance examination. Journal of Applied Psy-

chology, 1955, 88, 343—347.

Spergel, P., and Lechner, S. S. Vocational assessment through work’

sampling. Journal of Jewish Communal Services, 1968, 88,

225-229.

Steel, M., Balinsky, B., and Lang, H. A study on the_use of a work

sample. Journal of Applied Psychology. 1945, 88, 14-21.

Stuit, D. 8. Personnel research and test development in the bureau of

naval research. Princeton, N.J.: Princeton University Press,

1947.

Tiffin, J. and McCormick, E. J. Industrial psychology. New Jersey:

Prentice Hall, Inc., 1965.

Tiffin, J. and Rogers, H. B. The selection and training of inspectors.

Personnel, 1941, 18, 14-31.

Wernimont, P. F. and Campbell, J. P. Signs, samples, and criteria.

Journal of Applied Psychology, 1968, 88, 372-376.

MICHIGAN STATE UNIVE

llll ll3 1293

T

|| 1111 111111111155030617900

.0‘00 . I 0.0: 503 E'MFAUR I...I c %-xﬂ' A. * on... T'Al-...::..,. 4... v--1 ~. H.l..\"O...

Documents

Transcript of .0‘00 . I 0.0: 503 E'MFAUR I...I c %-xﬂ' A. * on... T'Al-...::..,. 4... v--1 ~. H.l..\"O...