.0‘00 . I 0.0: 503 E'MFAUR I...I c %-xfl' A. * on... T'Al-...::..,. 4... v--1 ~. H.l..\"O...
Transcript of .0‘00 . I 0.0: 503 E'MFAUR I...I c %-xfl' A. * on... T'Al-...::..,. 4... v--1 ~. H.l..\"O...
I‘.ntc
3.333%
Zzir
.>.'
...;
.
-..
-
-xfl'
A.*
on...
T'Al-
..
.
’-v-I~.:3‘:
I.
.,
..
4‘Z‘..-.A_\c'.
.‘v
—-
--
.1
~..H .l
..\"O"-‘—24‘I
-Lc
‘.I
0o
oA.II«a
..
.
«as...
>.
.,..o
E'Z?
I~1~
"-"‘
-fr.4%?
-.-
.-
.*I~‘..Iob':
wx‘tI‘I.:;~T
:
‘L‘~va
"\I-not“
M‘m.tp.q“’
at-l
.‘L
v'O
o.
..
. ..
~3
‘-
:53?“:IIxQ"'.I
'~”1~to‘%w*m
.
Ari-735'
.Af’fi‘f‘fi.“"and.‘
‘.‘
."«‘ALIIsrfi‘0!
‘¢?LI.:.’
15-i,
|.
II
I
3‘13‘.‘
"‘."I
"*3Mn,
#33"
.I.~
~
,3,AA
«A;,yvr.;...A:.z
..I
.IFQI('.Ih'
"I?3%?
*é‘w‘
I'
I.;}.I
..Wit-Egg.
.‘
‘Wfifr1\;0|'::'l‘.1
c.
AI\INI
I.
:AI‘..L::
’.L‘u-‘
“.
“I“.M“afl:"uhI
A..IQ.'V
.111...
‘53.},
,(51.2
figudysifiaw
..
.,
1'
II"
w]an
I‘3.
.4'.
w,‘
"‘V?‘I‘I“.1’r
II
‘0
A‘.3'
.
”a?
ti,\:
I.
..
ul
I,”
v,
I.
new
-.
A.
O
f?%'
”
‘X‘CA
.I1x.
?.urIf‘AIIAIMI
-A
F.’
u3‘wm'"A
If“?
.’
‘IIHII'E‘A
YIN"
I‘MIII
p”
‘:I;';!..5.A.
'00
A
D.‘_
.
{AIM
‘j'VUI'fioI‘u‘lug
.IHII'II:'
lko'ghu
I
u“I?
”mi:
“u‘
,v
..O’IIN“,Q‘a
.I‘AII..:-'\;.AI"
.I
IL1:\.AII"‘III-.
»‘.0;“"3.;II
-'-cOI
''.IW
‘v
||.
I‘.
v.'
.',
Ln
IJII
..IA
J-JIIA
v‘O
fifl'l
flflaun.»wt.1
..v.
":.'.|Ino»\
W‘...‘
'‘v‘lfl
wwiIILAIM
NH
Q‘s‘quAA.
I‘If".§II-‘I
‘EIAI.;.I
..
3w:we}?
w,
‘.I-*-I
III»
...W5
,3"??? {mm
W}...
.;.18”,
WNW
$1.1".
\J
0
MA!
'V.
‘u.INL'fi'Q.‘
‘.I”or
:rA
W-
’.~In»?C’AA-rlnu-AAAL.
f -‘ér§zs§s ICir ti 9
O
A. 0
K
c
Eegfeie' I}. I?! .-
A
A
a
o
I
a0'.
‘1‘
'AVJIJDU‘“‘_~“F‘_1""
.III..."K."’
:.
\.I
IuIA‘J'I?‘
\.
III‘I'NJH
\nl,.\a.";
—‘o‘
ustrI'.'oI'.
'Ir
II‘I"
A'
.’.‘..
Ii-..
..-
'I'
o'l
IV.
—I
I‘
~..
'.
.
“'-"J"O'.A'.;«;.-.I
,'.
.
“I:.’0.\.".
..
‘-
'-
‘_
oI
I"
I'Al"
IN
.I
-3
nrnn-J-AF‘;
A
O
' r503 Tfi
BF A P4
ER
0
A
E'MFAUR
,
AAAAA ,~
I
t
A
-5
I
.0‘00
.1.
I
‘0.0:
..(hunt
a...
.
.. J,
,_ l.‘
. ’13‘ . I
-_..- .
. .
-
A .
. —‘ ‘1
. - ,c
0
II .I
.. .0-
.
- .o .
O u - a
.
-. h
. b‘ o
.
s . . -_ c-.
. - o.
. Q. '
-‘9. .‘
- .._., .
‘7...- ‘
“\-~ .
o
a'.
4 4 '
.. -.
o
Y "I
_.
q-wvrr
o
A
O
Q
"d'_.
o"
7"v—‘Evr' x‘ycmm
(J'\_
>
-<
’'Q
\.
.“h...
.___,.‘
ABSTRACT
RELIABILITY OF A PERFORMANCE TEST FOR THE METAL TRADES
By
Alan L. Greenthal
A study was carried out to examine the reliability of a
performance test constructed for the metal trades. End products were
evaluated by non-tradesmen, using precision instruments for tolerance
measurements and a system of benchmarks for quality of finish of the
end products. Interjudge agreement was high. The five tasks on the
test yielded high internal consistency reliability. Retesting the
subjects after a period of 7 to 20 weeks did not show stability of
scores. This lack of test-retest reliability was not unanticipated.
The interim period was too long to assess test-retest reliability.
The Optimal time period would have been one to two weeks, a period
of time which would not have allowed for the differential training and
practice that the testees went through in the longer time period. The
use of performance tests by government and industry in occupational
licensing and in job selection is discussed.
\.
RELIABILITY OF A PERFORMANCE TEST FOR THE METAL TRADES
By
Alan L. Greenthal
A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
MASTER OF ARTS
Department of Psychology
1975
ACKNOWLEDGMENTS
This study was made possible through a grant funded by the
U.S. Department of Labor. Dr. Frank L. Schmidt was the principal
investigator. Mr. Milt Murto, Mr. William Main, Mr. Duane Ebbert,
Mr. William Neggly, and Mr. Walter Okrongley were the representatives
of the participating companies involved with the project. Mr. Ralph
Vanderslice was the master machinist whose assistance was invaluable
to the project. Special thanks goes to Dr. John Hunter for his help
with the data analysis. Dr. John Berner, Ms. Felicia Williams,
Ms. Susan Badertscher, Ms. Anna Toth, and Ms. Barbara Ralsky were
the evaluators of the end products. The help and guidance of
Dr. Frank L. Schmidt, as chairman of my masters thesis committee,
was invaluable in carrying out this study. I would also like to
thank Dr. Neal Schmitt and Dr. John Wakeley for serving on my
committee.
ii
TABLE OF CONTENTS
Page
LIST OF TABLES .......................... iv
LIST OF APPENDICES ........................ vi
INTRODUCTION........................... l
PROCEDURE ............................ 24
RESULTS ............................. 34
Inter-Judge Reliability ................... 34
Internal Consistency Reliability............... 50
Correlation Matrix ...................... 53
Test-Retest Reliability ................... 55
DISCUSSION............................ 57
Evaluating the Evaluators .................. 57
Evaluating the Test ..................... 60
Evaluating the Test (The Second Time Around) ......... 62
Conclusions ......................... 63
APPENDICES............................ 66
REFERENCES............................ 94
iii
Table
10.
ll.
12.
13.
LIST OF TABLES
Intercorrelations for work sample measures
(Campion, 1972) .....................
Interscorer reliability of subtests (Bornstein
et al., 1957) ......................
Test-retest reliability (Bornstein et al., 1957) .....
Subtest intercorrelations (Bornstein et al., 1957) . . . .
Intercorrelations among trouble-shooting subareas
(Siegel and Jensen, 1955) ................
Inter-judge reliability..................
Internal consistency (coefficient alpha) .........
Contents of correlation matrix ..............
Inter-judge reliability of tolerance scores--raw
measurements (computed on first 68 subjects) ......
Inter-judge reliability of tolerance scores--scored
measurements (computed on first 68 subjects) ......
Inter-judge reliability of tolerance scores by
machine--raw measurements ................
Inter-judge reliability of tolerance scores by
machine--scored measurements ..............
Average inter-judge reliability of each tolerance
dimension (corrected by Spearman-Brown formula for
three judges) ......................
iv
Page
12
14
15
18
28
31
32
35
36
38
38
39
LIST OF TABLES (cont.)
Table
l4.
15.
16a.
16b.
17a.
17b.
18.
19.
20.
21.
22a.
22b.
Page
Inter-judge reliability for finish dimensions ....... 41
Comparison of inter-judge reliabilities for tolerance
and finish ....................... 43
Means and standard deviations of raw measurements (odds) . 44
Means and standard deviations of raw measurements (evens). 45
Means and standard deviations of scored measurements
(odds) ......................... 46
Means and standard deviations of scored measurements
(evens) ......................... 47
Coefficient alpha by judge for tolerance scores ...... 49
Correlations of absolute deviations of scored
measurements with time ................. 5l
Internal consistency reliability ............. 52
Intercorrelations of scores within the performance test. .\ 54
Test-retest reliability by machine ............ 55
Test-retest reliability by dimension ........... 56
LIST OF APPENDICES
Appendix . Page
A. TRANSCRIPTS OF TAPED INSTRUCTIONS AND
BLUEPRINTS OF TASKS ................... 667
Transcripts of Taped Instructions
Horizontal Mill Instructions ............ 66
Vertical Mill Instructions ............. 67
Drill Press Instructions .............. 68
Lathe Instructions ................. 69
Surface Grinder Instructions ............ 7O
Blueprints of Tasks
Horizontal and Vertical Mill Tasks ......... 7l
Drill Press Task .................. 72
Lathe Task II .................... 73
Surface Grinder Task ................ 74
B. TOLERANCE EVALUATION SHEETS. FINISH EVALUATIONS,
AND EXPLANATION OF DIMENSION NUMBERS .......... 75
Tolerance Evaluation Sheets
Horizontal Mill Task ................ 75
Vertical Mill Task--Milling a Pocket ........ 77
Drill Press Task .................. 79
Lathe Task II--Boring, Facing, and Chamfering. . . . Bl
Surface Grinder Task ................ 82
Finish Evaluations -
Horizontal and Vertical Mill Tasks ......... 83
Drill Press Task .................. 84
Lathe Task ..................... 85
Surface Grinder Task ................ 86
Explanation of Dimension Numbers ........... 87
vi
LIST OF APPENDICES (cont.)
Appendix Page
C. EXPLANATION OF SCORING SYSTEM AND 2 POINT SYSTEM ..... 89
Explanation of Scoring System. . . I.......... 89
2 Point System .................... 9]
vii
INTRODUCTION
McClelland (1973), in a critique of the testing movement in the
United States, suggests that there is an overreliance on intelligence
or aptitude testing, and that the evidence of validity of these kinds
of tests does not support the widespread use that these tests have been
receiving. This position reflects the movement in the United States
which advocates less emphasis on tests of ability and more emphasis on
tests of achievement. McClelland suggests a number of ways to improve
this situation of overreliance on intelligence or aptitude testing.
His first suggestion is that the best kind of testing is criterion
sampling. He states that there is ample evidence to show that tests
which sample job skills will predict proficiency on the job and sug-
gests that there should be less reliance on paper and pencil tests
which tap a general intelligence factor or other "unrelated" abilities.
Wernimont and Campbell (l968) propose an alternative to the
"classic validity model," which results in low validities and misappli-
cations. Their alternative is the “behavioral consistency" approach
(also advocated by Campion, 1972) which is based on the idea that the
best indicator of future performance is past performance. The classic
validity model uses tests as "signs" to predict, and it is suggested
that the use of "samples“ of behavior would work better. Part of this
behavioral consistency approach involves the use of work samples or
job samples. According to Wernimont and Campbell the consistency ap-
proach would reduce or eliminate problems associated with faking and
response sets, discrimination in testing, and invasion of privacy.
Job sample testing falls into the general category of per-
formance testing. According to Adkins, et al. (l947) a performance
test is one in which the subject is directed to carry out some ac-
tivity. There are two kinds of performance tests. The first is
aptitude testing, which is not job sampling, according to Adkins,
et al.1 An example of this kind of test would be some sort of form
board test. The second kind of performance test is an achievement
test, which is a job sample test. The authors, in 1947, were not very
enthusiastic in recommending the use of performance tests. They sug-
gest that these tests may be less valid than written tests.
Twenty-five years later times have changed, and while there may
not be widespread use of performance testing, the idea of this kind of
testing seems to be one which has gained much popularity among psychol—
ogists, especially those in the personnel area. "Today it is generally
1The exclusion of work samples from aptitude testing may not be
totally accurate. See, for example, the report on Job Trials of the
Jewish Employment and Vocational Service (1973).
conceded that written tests of trade knowledge are not a very depend-
able way to evaluate shop performance and that without some type of
direct or indirect measure of actual performance it is unlikely that
we can make an accurate assessment of an individual's trade competence”
(Boyd and Shimberg, 1971).
O'Leary (1973) points out that predictive or concurrent val-
idity is not ideally the only requirement for fair selection. He sug—
gests that content validity should be emphasized. This is especially
important in light of Title Seven of the 1964 Civil Rights Act.
O'Leary suggests that the more nearly the test duplicates the specific
tasks to be performed on the job, the greater the chances are of devel-
oping selection devices that are fair. He suggests that job sample
testing be used to meet this standard. In addition to.the 1964 Civil
Rights Act, the emphasis on content validity becomes important to em-
ployers who are bound by the guidelines of EEOC (1970), OFCC (1971),
and recent court decisions, especially the Supreme Court case of
Griggs vs. Duke Power (1970). The reason <If fair employment prac-
tices seems to be the most compelling one for a growing importance of
job sample testing.
O'Leary's article has not gone without criticism. Gael (1974)
and Blood (1974) criticize O'Leary for not backing up his statements
with empirical evidence. Cole (1974) points out that the issue of
content of the predictors used is a matter of social values and public
policy, and not a question of psychometrics as presented by O'Leary.
These critics, generally speaking, were not so critical of job sample
testing, but were critical of the way O'Leary presented the idea.
A system of performance testing.has been successfully used by
the New Jersey Civil Service Commission in the selection and training
of people in various trades, such as carpenters, brick layers, auto
mechanics, and truck drivers (Scheuer, 1970). Some advantages of per-
formance testing are claimed in the article:
1) they are the simplest and most economical type of test to
prepare and administer for trades and related positions;
2) they yield results that are more reliable than those of
written or oral tests;
3) a large variety of items are available just by selecting
them from projects in the various trades;
4) there is a much smaller likelihood of failing the candidate
who can do the job, but who lacks the verbal ability to
explain what is to be done;
5) oral or written tests "turn off" the people that need to be
hired--their reaction to performance tests is much more
positive.
The last point is also supported by Steel et al. (1945), who present
evidence showing that people like work sample tests better than other
kinds of testing, although they do not necessarily find it to be
easier than a dexterity test, which is what the work sample test was
compared to.
Scheuer's article is written.in a non-scientific manner with
little empirical evidence presented to back up his arguments. Scheuer
states that performance tests are more reliable than written tests, yet
he has no empirical evidence to back this up. He states that perform-
ance tests are the most simple and economical type of test, yet it has
been complexity and prohibitive costs that have been major reasons for
such tests not being used more in the past. It is interesting and
important to be aware that a system of performance testing has been
successfully employed in New Jersey; however, a more scientific ac-
counting of its success could clarify some of the points that Scheuer
raises.
O'Leary (1973) states that job sample testing "helps the com-
pany to learn something important about the applicant's suitability for
the job, and it enables the applicant.to learn something important
about the job's suitability for him." This type of mutually beneficial
situation also characterizes use of job samples in counseling disadvan-
taged people. The Jewish Employment and Vocational Service of Phila-
delphia (1968) has studied the use of a work sample program in coun-
seling and training disadvantaged applicants and has found the program
to help counselors and counselees in a number of ways. Spergel and
Leshner (1968) in writing about this program state that the major
virtue of the work sample approach is that it is reality oriented.
Gordon (1969) reviews how different agencies have used work sample
testing, mainly in counseling disadvantaged youth, and describes sev-
eral reasons why he thinks the work sample technique is an extremely
valuable one:
1)
2)
3)
4)
5)
6)
7)
8)
It is non-verbal.
It has high content validity.2
Obvious relevance of the test.appears more sensible and
therefore acceptable to.test-suspicious youth who are thus
likely to be well motivated to perform on it--more moti-
vated than in other testing situations.
It is likely to be a better predictor of job performance.
It makes such apparent good sense that it may be more
attractive to employers.
It allows for self-assessment of vocational skills and
an opportunity to discover self-interests; scores are
easier to understand than paper and pencil test scores.
Work sample testing provides a firm reality on which to
base self-assessments and provides a concrete base for
disadvantaged people's image of occupations and work
careers.
It is possibly less of a device for predicting success
and more of a device for producing success.
A number of authors have addressed themselves to the issue of
reliability in performance testing. As early as 1945, McPherson stated
that more research into reliability of performance tests is needed.
Part of Adkin's et al. (1947) lack of enthusiasm in recommending the
fully.
2This will only be true if the test is constructed very care-
use of performance tests was due to two examples that they cite which
show that work-samples were not scored reliably; that is, interrater
reliability was low. Shimberg et al. (1972) state that the most serious
shortcoming of performance tests used in occupational licensing exami-
nations is
the 1ack of adequate criteria or standards for evaluating
performance. Raters need clear and specific directions as
to what they are to look for, what constitutes acceptable
performance on a given task, and how much credit should be
deducted for failure to satisfy the criteria in specified
ways. Without guidelines, each rater.is forced to use
subjective measures that are based on his own experience
and standards.
The previous paragraph is addressed to the issue of scoring
reliability--how well judges, raters, evaluators, observers agree in
scoring the parts of a performance test. Another kind of reliability
in performance testing that is important to look at is test-retest
reliability. Wernimont and Campbell (1968) cite a review of the
literature on criterion theory by Ronan and Prien (1966), who come up
with the conclusion that, with the present available data, the ques-
tion of whether job performance is reliable cannot be answered. They
found very few studies that actually used the same criterion measure
to assess performance at two or more points in time.
In the absence of much knowledge concerning the stability
of relevant job behaviors it seems a bit dangerous to
apply the classic validation model and attempt to gen-
eralize from a one-time criterion measure to an appre-
ciable time span of job behavior., Utilizing the consis-
tency notion confronts the problem directly and forces a
consideration of what job behaviors are recurring con-
tributors to effective performance.(and therefore pre-
dictable) and which are not (Wernimont and Campbell,
1968).
Most studies that deal with.the issue.of.reliability in per-
formance testing deal with interjudge kind of reliability. There is a
scarcity of literature on test-retest or internal-consistency relia-
bility, although some studies report intercorrelations of parts (sub-
tests) of a performance test. The purpose of this paper is to examine
issues in the reliability of a performance test. Before this study is
described in detail, a review of the literature which reports reliabil-
ities of performance tests will be presented.
Campion (1972) advocates the behavioral consistency approach of
Wernimont and Campbell. However, he writes that the lack of guidelines
for sampling behaviors seems to be a major obstacle to wider use of the
consistency approach. Using job experts, Campion came up with his work
sample test for maintenance mechanics.. There were four parts of his
work sample test. These parts and their intercorrelations are pre-
sented in Table 1. Subjects were 34 maintenance mechanics. Besides
the work sample data, various paper and pencil aptitude tests were
administered, and foremen's evaluations of subjects were collected.
The work sample test was significantly related to the foremen's evalu-
ations; the paper and pencil tests were not. The method of evaluation
used in the work sample test was a check list of behaviors to look for
TABLE l.--Intercorrelations for work sample measures (Campion, 1972).
I
t
Intercorrelations
Part
B C D Total
A) Installing pulleys and belts . .25 .01 .16 .63
B) Disassembling and repairing a gear box .11 .27 .64
C) Installing and repairing a motor .07 .42
D) Pressing a brushing into sprocket and
reaming to fit a shaft .70
given to an observer. No mention was made in the article about any
evaluation of end products, nor were any reliability data given, al-
though coefficient alpha, computed from the data in Table l, is .40.
One of the most thorougly done research projects in the area of
performance testing was that of Bornstein et a1. (1957). The authors
did research on the Basic Military Performance Test (BMPT), a work
sample type of measure of achievement in basic training in the U.S.
Army. The BMPT consists of 13 subtests, administered at 13 individual
stations, and requires 16 men for administration. These men observe
and record behaviors.
Bornstein's study is particularly significant because he looks
at reliability fairly extensively. Few studies in the literature have
paid close attention to reliability of performance measures.
10
Bornstein looks at test-retest reliability and scoring (interobserver)
reliability.
Performance items on the BMPT were classified into tangible
items (end products) and intangible items (observations). Two ob-
servers scored 43 tangible items and 57 intangible items on a pass-fail
basis. Phi-coefficients were computed for each item as a means of
evaluating interscorer agreement. The mean phi-coefficient for the
intangible items was .611 (S.D. = .227), and for the tangible items it
was .776 (S.D. = .187). An alternate way of reporting interscorer
agreement is reported elsewhere (Bornstein et al., 1954). That way is
the percentage agreement between scorers.3 For tangible items it was
93% (S.D. = .07), and for intangible items it was 87% (S.D. = .10).
The authors conclude that intangible items can be used with only a
slight loss of reliability as compared with tangible items, but the
added validity (from increasing test length with items that are highly
content valid) should more than compensate for the reduction in scoring
agreement.
3The percentage agreement method of reporting interscorer
agreement cannot be interpreted in the same way as an interscorer re-
liability coefficient. The reliability coefficient will be high when
two observers check the same number of items for each observee, regard-
less of whether there is any overlap in the particular items checked.
In this case the percentage agreement would be low. However, a serious
shortcoming of the percentage agreement method is that items with no
variance will inflate the percentage agreement reported. The greater
the number of items with no variance,.the higher the percentage agree-
ment.
ll
Correlation coefficients between scores assigned by the two
different scorers on the 13 subtests ranged from .45 to .95, with a
mean interscorer reliability of .78. Table 2 reports the number of
items contained in each subtest, the number of examinees tested, the
difference between each observer's mean score for each subtest, dif-
ference between standard deviations, and the interobserver correlation.
The authors suggest that the reliability of the test may be too low
for individual diagnosis (or at least limits the use of the test for
this purpose), but it still can have its value in use for group pre-
diction.
Interscorer reliability for the total test was estimated to be
.78, the mean of the subtest reliabilities. Total test reliability
could not be computed directly because the same two observers were not
consistently used for the 13 subtests. The authors present a rationale
for using .78 as a conservative estimate of interscorer reliability of
the total test.4
4Interscorer reliability of the total test had to be estimated
because it was not possible to have a complete set of observers at each
of the 13 subtests. The rationale for using .78 as an estimate of
total test reliability was as follows: Fourteen composites of from
3 to 5 subtests with complete sets of observers were examined. The
reliabilities of these composites computed in two different ways were
looked at. One way was computing interscorer reliability of the com-
posite directly. The other way was calculating the interscorer relia-
bility separately for each subtest and averaging these reliabilities.
It was found that the composite reliability was somewhat higher than
the average reliability in most comparisons, and so the authors took
12
TABLE 2.--Interscorer reliability of subtests (Bornstein-et al., 1957).
Sgaztggtor #:émgf EggmiHZes BELZZén Bgtzzen r
Tested Means S.D.'s
l 9 166 - .22** .18* .87
2 9 245 -l.16** .42 .61
3 7 212 .08 .04 .68
4 6 178 .05 .19** .82
5 10 178 .12** .23** .72
6 10 164 .24** .10 .87
7 5 171 04 -.02 95
8 6 177 .09 .08 .94
9 5 224 .01 .13* .48
10 6 294 - .01 -.O6 .90
11 10 227 .09 .04 .50
12 10 278 .10 .08 .74
13 8 170 .43** -.36** .87
*p < .05
**p < .01
the average of the subtest reliabilities as a conservative estimate of
the reliability of the total test. This "finding" was predictable from
classic reliability theory. When different subscales are assumed to
measure different things, and they are uncorrelated, the best estimate
of reliability is the mean of the individual reliabilities. If the
subscales are positively correlated, as was the case in the Bornstein
et al. study, then the above estimate will be a conservative estimate
of the reliability of the total test.
13
Table 3 presents data on test-retest reliability. Entered in
the table is the subtest, number of examinees taking the subtest, means
and standard deviations for the test and the retest, the differences
between the means and standard deviations, and test-retest reliability.
The mean test-retest reliability of the 13 subtests is .39. The relia-
bility coefficient for the total test is .67.5 As can be seen in the
table, the mean scores went up on the retest. The authors suggest that
learning may have biased the reliability coefficients--without learning,
reliability would have been higher.
Table 4 reports the intercorrelations of the 13 subtests with
each other and with total score. As was the case with Campion's re—
sults, these intercorrelations are low and positive. Coefficient alpha,
computed from the data in Table 4, is .62.
. Other results that Bornstein found were that superior and peer
ratings had little or no relationship to the performance test (this is
in disagreement with Campion's findings) and that the BMPT had a lower
correlation with a reading and vocabulary test than did the written
achievement test that was used (.29 vs. .54). This is an indication of
5The fact that test-retest reliability of the total test is
greater than the mean test-retest reliability of the subtests is to be
expected here. Total test is greater in length than each subtest and
therefore total test reliability will be greater than the average
reliability of the subtests provided that subtests are, on average,
correlated positively.
14
TABLE 3.--Test-retest reliability (Bornstein et al., 1957).
Station No. of Test Retest Diff. Diff.
or Examinees Between Between r
Subtest Tested X' 5.0. X' S.D. . X“s S.D.'s
l 166 2.56 1.11 2.57 1.04 .01 .07 .19
2 165 7.10 1.90 7.62 1.65 .52** .25 .10
3 166 4.60 1.16 4.71 1.29 .11 .13 .32
4 163 2.29 1.40 3.27 1.50 .98** .10 .30
5 157 3.01 .98 3.48 .77 .47** .21** .38
6 167 7.19 1.66 6.96 1.95 .23 .29 .45
7 165 3.30 1.20 4.23 1.13 .93** .07 .22
8 168 3.70 1.81 4.56 1.70 .86** .ll .53
9 169 2.78 1.04 2.96 .85 .18* .19** .39
10 164 3.30 1.25 4.11 .63 .81** .62** .07
11 169 3.39 1.72 3.53 1.60 .14 .12 .41
12 166 1.89 1.27 3.40 1.31 .51 .04 .07
13 168 4.04 2.13 4.77 '2.18 .73** .05 .51
Total 142 49.74 8.52 56.35 6.92 6.61** -1.60* .67
*p < .05
**p < .01
TABLE4.-Subtest
intercorrelations
(Bornstein
et
al.,
1957)
(N
=307).
Intercorrelations
67
810
11
12
13
PNMVLOSONCD
11
12
13
Total
.16
.14
.02
.28
.16
.16
.16
.17
.53
.07
.17
.13
.10
.40
.16
.13
.25
.12
.10
.21
.32
.03
.13
.53
.08
.21
.11
.19
.13
.07
.06
.41
.10
.12
.13
.56
.42
.45
15
16
how a verbal factor, which may be unrelated to performance, is present
in written tests and not in performance tests. This verbal factor will
tend to have a negative impact on the test results of disadvantaged
groups.
Siegel (1954b) studied intraobserver consistency. He describes
the ideal method for determining the consistency of an individual
examiner as a situation where examinee's performance is held constant
over two separate occasions. and the observer's perceptions are allowed
to vary. The best way to do this is with a motion picture. In his
study films were made of Naval Aviation Structural Mechanics taking a
Drill Point Grinding Work Sample Performance Test. Films were shown
twice to five observers, with a one month interval between each show-
ing. Observers were given an evaluation form to fill out while viewing
the film. This form contained questions regarding safety and process
used by the examinees in the film. Intraexaminer consistency was de-
termined by dividing the number of items answered exactly in the same
manner on each showing, by total number of items on the evaluation
form. The figures for percent consistency were 64.3%, 71.4%, 85.6%,
92.3%, 100%. with a mean of 82.8%. Siegel suggests that the range in
intraobserver reliabilities warrant a careful investigation into the
area of intraobserver reliability. It is necessary to have high
intraobserver reliability in order for interobserver-reliability to
be high.
17
McPherson (1945) constructed a work sample test for the wood
shop. She looked at intraobserver consistency of end product measure-
ments, as in constrast to Siegel's process evaluations. End products
were measured at two different times by one psychometrician. She found
intraobserver reliability to be in the area of .97.
Siegel and Jensen (1955) developed a job sample trouble-
shooting performance test for aviation electricians.6 The test con-
tained five subareas, in which the electricians (N = 137) had to
identify certain problems in the functioning electrical mechanisms.
The authors found the split-half reliabilities of the subareas, cor-
rected by Spearman-Brown, to be .90, .59, .72, .84, and .64, and for
the composite it was .86. Intercorrelations between areas of the
performance test are presented in Table 5. Coefficient alpha, com-
puted from these intercorrelations, is .59. The authors report these
intercorrelations to be relatively low, with the exception of I and
III; however, compared to the results of Bornstein (1957) and Campion
(1972) they were not so low. The validity of the test, according to
the authors, proved to be substantial--the more experienced elec-
tricians scored higher.
6The term "job sample" may be inappropriate for the test de-
scribed in this paragraph. Although it is unclear from the article,
it seems that the test was a written description of a situation that
may arise on the job, rather than the situation itself.
18
TABLE 5.--Intercorre1ations among trouble-shooting subareas (Siegel
and Jensen, 1955).
Subtest I 11a 11b III
11a .16
11b .32 .20
III .50 .23 .34
IV .16 .06 .11 .17
Siegel (1955) compared the scoring of tangible and intangible
items of the aviation structural mechanics tests, an individually ad-
ministered performance test. A check-list method was used to score the
intangible items. Interobserver consistency within a test was calcu-
lated using the percent consistency method of Siegel (1954b). Siegel
found that tangible and intangible procedures yield about equal con-
sistency between observers. This differs somewhat from Bornstein's
findings that intangible items were scored slightly less reliably than
tangible items. Siegel attributes his findings to objectivity in the
check-list procedure and grossness of the observations called for.
In another study using check-lists to evaluate performance in
job sample tests, Siegel (1954a) concludes that check-lists (used to
evaluate process and unsafe behaviors) of performance are to be pre-
ferred over "clinical" (subjective) appraisal of end products. The
19
check-liSt is prepared by analyzing a task into component actions which
a man performs in order to complete a task. Siegel reports inter-
examiner reliability in the .90's. His check-list procedure is a more
objective one than clinical appraisal of end products which involves
making subjective evaluations about the quality of the finished
product. The value of this comparison is somewhat shaky. By using
subjective clinical appraisals of end products Siegel is setting up
a straw dog and knocking it over. A more valid and useful comparison
would have been to use some objective appraisal of end products and
compare this with the objective evaluation of process and unsafe be-
haviors that were used in the study.
Fredriksen (1962) constructed an "in-basket" test for managers.
This kind of test can be used to select or promote people in manage-
ment positions. It is performance in nature because it simulates kinds
of activities that the testee would need to carry out on the job. The
test consisted of 68 categories. Testees were scored by 2 observers--
one observer scored the odd numbered items within each category, and
the other scored the even numbered items within each category. Fred—
riksen reports the split-half reliabilities of each category. These
reliabilities, which reflect both internal consistency and interob-
server agreement, ranged from .87 to .00, with a median reliability of
about .40.
20
A few other studies mention that reliability was looked at, but
present no data. Besnard and Briggs (1967) report a study on develop-
ment of a performance test to evaluate maintenance personnel for the
Air Force's E-4 Fire Control System. They report interobserver agree-
ment to be high. Robins et a1. (1958) report a test to evaluate a non-
commissioned officer's ability to generate the support of subordinates
in getting a job done. Authors report "adequate" reliability (internal
consistency) of the total test, although one of the three subscores was
reported to have less than adequate reliability. Havron (1954) reports
close agreement between observers in a test to assess effectiveness of
infantry rifle squads in the army.
In constructing a performance test, or in choosing which test
to use for some research or industrial purpose, one may be faced with
the issue of whether to use a test in which process is most important
or one where the product is what really counts. Boyd and Shimberg
(1971) write about the importance of process (intangible) vs. product
(tangible) evaluation. They report that in the original planning of
a structural mechanics test, equal weights were to be given to process
and product. However, this was changed in the end because the chief
machinist mates objected, saying that the end product is what is most
important. Schmidt (1974) gives a number of advantages of evaluating
end products over process evaluations:
21
1) The number of test administrators can be reduced; the nature
of the test may be group administered rather than individ-
ually administered.
2) It may be less difficult to train nonpsychologists (or
people who have had no previous experience with the sub-
ject matter of the test) to evaluate end products than to
observe and record behaviors.
3) Interevaluator agreement may be higher when end products
are what are being evaluated.
4) Examinees may feel less threatened or nervous when they are
not constantly being watched.
5) Evaluation of end products can take place after the test,
at the convenience of the evaluator.
6) The resulting scores should be more valid, since it is the
ability to produce high quality finished products which is
important in real life; the method of how they were pro-
duced is merely a means to this end and is therefore of
secondary concern.
This study deals with reliability of a job sample test con-
structed for one of the skilled trades--that of machinist. End
products, rather than process, are what were scored. The type of
product that a machinist produces can be evaluated on two dimensions.
One is the accuracy of the actual physical dimensions. This calls on
the evaluator to make physical measurements, usually using precision
measuring instruments. The other dimension is quality of finish, or
how smooth or rough the end product turned out to be.
Pertaining to the finish dimension, Tiffin and Rogers (1941)
report a study in which 150 judges coded (evaluated for finish) 150
sheets of tin. The sheets contained acceptable sheets and sheets with
22
four different kinds of defects. The sheets were presented in random
order to the inspectors, who identified the sheet as-acceptable, or
called out the kind of defect it contained. Reliabilities were com-
puted on each categroy and were reported to be between .68 and .90.
These reliabilities represent the extent to which repeated or duplicate
measurements of each inspector by means of this "coded stack test"
would result in the same score for the defects in question for each
inspector. Large variances in time taken to inspect the sheets were
also reported. While .90 is fairly reliable, .68 is not so desirable.
Furthermore, the variation in time needed to inspect leaves room for
improvement in the inspection process.
Tiffin and McCormick (1965) point out the well-known fact that
judgments that are relative, rather than absolute, are more accurate.
They suggest the use of "limit samples" to evaluate finish. Limit
samples are samples of work pieces that are just barely acceptable
enough to fall into a certain category. These limit samples have the
maximum amount or degree of defects that would be allowed for a par-
ticular category. The inspector is to compare the pieces to be in-
spected with the limit samples and make his judgments accordingly.
Such a comparison usually results in a judgment more adequate than
when an inspector relies on a "memory image" of the degree of a defect
that is acceptable vs. not acceptable. Kelly (1955) found in a study
of inspection of glass panels that untrained subjects were able to make
23
consistent distinctions between the pieces of glass, when a procedure
using relative judgments was employed.
Stuit (1947) reports on interjudge agreement in grading end
products. Four judges evaluated 30 "samplers“ prepared by students
in a basic machinists course, using the usual method of combination
squares. Reliabilities ranged from -.11 to .55. A set.of taper gauges
and caliper gauges was devised, and two judges evaluated two more sets
of samplers yielding reliabilities of .93 and .96. This exemplifies
the importance of using the correct, and most accurate, measuring de-
vices if scoring relaibility (and retest and internal consistency re-
liability, as well, since these reliabilities depend on scoring relia-
bility) is to be high.
Lawshe and Tiffin (1945) and later Evans (1951) report on the
accuracy of precision measurements in industrial inspection. Their
studies show that accuracy of inspectors is far less than assumed by
most authorities in the field. They also found that measurements made
by apprentices are as accurate as those made by journeymen, and ac-
curacy is unrelated to age, seniority, or experience. Evans reports
that inexperienced people can be trained to use micrometers as accur-
ately as experienced industrial workers. The New Jersey Civil Service
Commission has successfully employed non-tradesmen in administering
their tests and evaluating end products (Scheuer, 1970).
PROCEDURE
The present study was carried out as a part of a project funded
by the U.S. Department of Labor (Schmidt et al., 1974). The project
had two main objectives. The first and most general objective was a
pilot empirical evaluation of a set of innovative procedures for the
construction of valid, reliable, and practical job sample tests in the
skilled trades and technical occupations. The second objective was the
assessment of the relative impact of performance tests and traditional
paper-and-pencil achievement tests on the employment opportunities of
minority and disadvantaged persons (Schmidt, 1972). The present study
focused on reliability of the test which Schmidt et al. developed.
The test was constructed for machinists, and consisted of five
tasks to be carried out on five different machines: vertical mill,
horizontal mill, drill press, surface grinder, and engine lathe. The
test was administered in a machine shop by members of the project's
research staff. Subjects (testees) were primarily apprentices in the
tool and die making, and related trades, who had had at least one year
of experience. Some journeymen also participated. It was possible to
obtain only a small number of machinists, but the tool and die makers
24
25
were adequate for the research purposes, since they all had sufficient
experience on the machines tested. Subjects came from various plants
of a large automobile company in Detroit, and three factories in
Chicago who employ people in the trades for whom the test was con-
structed.
After a brief introduction about the nature and purpose of the
research project and the performance test, subjects (usually five at a
time) moved out to the testing area (the machine shop). At the admin-
istration table were all the testing materials. Only one administrator
was needed because process behaviors were not being recorded. The
testing materials consisted of pieces of metal stock and blueprints.
At each machine were all tools necessary to carry out the various tasks
and tape recorded instructions explaining what the testee was to do.
The blueprints diagrammed the task to be machined and specified what
the tolerances of the dimensions of the finished product were to be.
Transcripts of the taped instructions, and the blueprints can be found
in Appendix A. Each testee was assigned to a station (machine), his
starting and finishing time was recorded, and upon completion of the
task his finished workpiece was turned over to the test administrator.
He was then assigned to a new station, and this process was continued
until all five tasks were completed.
The end products were labeled and taken back to the project
office of Michigan State University to be evaluated. The evaluation
26
forms can be found in Appendix B. The finished product was evaluated
on two dimensions-~finish and tolerances. The evaluation forms in-
struct the evaluator on what he or she was to do--where to measure
and what measuring instrument to use in the case of tolerance evalua-
tion, and what to look for or feel for in the case of finish evalua-
tion. In order to decide on the tolerances, the project staff en-
listed the services of a machinist journeyman. The same person also
helped set up the "benchmark" finish evaluation system. This system
was similar to the limit sample described earlier. Benchmarks were
selected from the end products themselves. These benchmarks repre-
sented different categories corresponding to qualities of finish. The
evaluator was to compare the piece to be evaluated with the benchmarks
and decide which category the piece fell in by making a judgment as to
which benchmark the workpiece was closest to.
Project staff members were instructed on evaluation procedures
by the project's machinist consultant. He showed the staff members how
to use the various measuring devices. These devices were a dial
micrometer (theoretical accuracy .0001 inches), dial caliper (theoret-
ical accuracy .001 inches), a scale, sliding parallels, and a telescOp-
ing gauge. Evaluators were project staff members, and student employees
who were trained, by project staff members, in how to use the various
instruments and evaluate the end products. In light of findings by
Evans (1951) and the experience of the New Jersey Civil Service
27
Commission (Scheuer, 1970), it was felt that non-tradesmen could
reliably evaluate the end products. This, however, is a matter of
empirical research.
The purpose of this research was not to prove specific research
hypotheses; it was to investigate the thesis that the kind of test de-
scribed above can be constructed to yield high (or at least adequate)
reliabilities (scoring, retest, and internal consistency). Scoring
reliability, or interjudge reliability, assesses how consistent various
evaluations are across judges. In order for a job sample test to be
used by industry, by a government agency, or in a counseling setting,
it must be scored reliably, or individual scores will be meaningless.
Table 6 outlines the interjudge reliabilities that were computed from
the data in this study. Reliabilities for tolerance evaluations, un-
less otherwise stated. are computed using a two point scoring system,7
as outlined in Appendix C. The subject received 2 points if his mea-
surement fell within the first tolerance (corresponding to the first
tolerance specified on the evaluation forms in Appendix B), 1 point
within the second tolerance, and 0 points if he did not meet either the
first or second tolerance. The finish evaluations were on a 3, 4, or 5
7Two points was the highest score any subject could receive on
a particular dimension, except in the case of the surface grinder,
which was scored on a three point system.
28
point system, corresponding to the benchmarks on the evaluation forms
found in Appendix B.
TABLE 6.--Inter-judge reliability.
A. Tolerance Scores--R's for first 68 §s (3 evaluators per subject).a
1) on raw measurements for all dimensions
2) on scored measurements for all dimensions' raw measurements
3) on scores for each of the five machines
4) on scored measurements for each of the five machines
5) average interjudge reliability for each dimension corrected
by Spearman-Brown for 3 judges
B. Finish Scores
1) for each finish evaluation
2) for total test
aThere are two separate reliability analyses--one for odd numbered
subjects and one for evens.
All finish evaluations were done by four evaluators. Two
people evaluated the odd-numbered subjects and two people evaluated the
even-numbered subjects. Six people evaluated the first 68 subjects'
tolerance dimensions. Three evaluated the odd numbers, and three
evaluated the even numbers. The first 68 subjects were the only ones
whose workpieces were evaluated by three people on tolerance, and
therefore only data from the first 68 subjects were used in computing
inter-judge tolerance reliabilities. The remainder of the subjects
(N = 42) were evaluated by two judges.
29
There are a number of questions related-to inter-judge relia-
bility that this study addressed itself to. They are as follows:
1)
2)
3)
4)
5)
Can non-tradesmen be used to score tests reliably?
Can tolerance scores be evaluated more reliably than finish
scores?
00 certain evaluators have constant biases (too high or too
low) in the way they measure the various tolerance dimensions?
Are some evaluators more "sloppy" than others in the way they
evaluate tolerances?
Do measurements become more reliable over time?
A "yes“ was hypothesized to be the answer to all five questions.
The following analyses were carried out to test the questions related
to inter-judge reliability:
1)
2)
3)
Inter-judge reliabilities were examined: high reliabilities
(around .90 and above for tolerance, somewhat less for finish)
are evidence that non-tradesmen can score the test reliably.
Mean inter-judge reliabilities on each machine were computed
for tolerance and finish, and were compared. Mean inter-judge
reliability on the total test was also computed and compared.
Fisher's r to 2 were computed on the reliability coefficients
and were compared.
Eight dimensions were selected to test this hypothesis--four
of these involved measurements with the calipers and four with
the micrometer. These dimensions were selected because they
involved the most straightforward use of micrometer and cal-
ipers. In order to be considered a "biased" measurer, an
evaluator had to be consistently high or low in his scored
measurements with a particular instrument, and this bias had
to reach a significant level. Matched-pair t-tests were used
to test the magnitudes of measurements that were consistently
biased.
30
4) Three different analyses were carried out to look at slop-
5)
piness of raters:
a) Variances of the evaluators were compared using F-tests.
b) Coefficient alphas were computed on each judge and were
compared using Fisher's z-transformations.
c) Interjudge reliabilities were examined in order to de-
termine whether any one judge was more sloppy than the
other two. If a sloppy judge was found, reliabilities
were to be compared using Fisher's z-transformations to
test for magnitude of the differences.8
Deviations (absolute values of measurement of measurer 1 minus
measurer 2) were correlated with time and tested for statis-
tical significance.
Internal consistency reliability assesses the extent to which
the test measures one general factor. It is directly dependent on how
the different parts of the test relate to each other. Besides the
various coefficient alphas, listed in Table 7, a correlation matrix
b)
8The rationale behind these analyses were as follows:
Variance would be at a minimum when there is no sloppiness.
The measurements of a sloppy evaluator will be more varied
(his measurement will differ from the true dimension) than
a less sloppy evaluator.
No method of significance testing specifically addressed to
the issue of comparison of two coefficient alphas could be
found. It was therefore decided to treat coefficient alpha
as an re, and Fisher's r to z transformations were computed
on the square root of alpha. These z's were compared.
If r12 is higher than r 3and r2 this means that judge 3
is a sloppier judge (less3reliabIe) than judge l or judge 2.
There will, of course, always be one judge with lower r's,
so in order for a judge to be considered sloppy. he needed
to have r's that were consistently lower.
31
TABLE 7.--Interna1 consistency (coefficient alpha).a
——“—“—~—‘ -
_t_, 1
A. Tolerance Scores computed on first 68 subjects using the median
measurement.
8. Finish Scores.
C. Tolerance plus finish scores.
3This analysis was done for each machine, for total test containing as
many dimensions as there are measurements, and for total test, using
each machine total as one item.
containing various parts of the test, outlined in Table 8, was com-
puted.
Due to practical considerations of the project from which this
study came, an adequate assessment of the test-retest reliability could
not be carried out. Retest subjects were volunteers who had responded
positively to a letter sent to them at the end of the testing period
for the first 68 subjects. Because of this, the time interval between
the first and second testing was too long. The optimal time period
would be from one to two weeks. Any period longer than that would have
a negative effect on test-retest reliability. This is because there
was no standardization in the apprenticeship program, resulting in
different amounts of training and practice on the machines for
32
TABLE 8.--Contents of correlation matrix.a
A. Performance Test
1) Total tolerance plus finish score
2) Total tolerance score ’
3) Total finish score
8. Tolerance Plus Finish Scores for:
1) Horizontal Mill
2) Vertical Mill
3) Drill Press
4) Lathe
5) Surface Grinder
C. Tolerance Scores for:
1) Horizontal Mill
2) Vertical Mill
3) Drill Press
4) Lathe
5) Surface Grinder
D. Finish Scores for:
1) Horizontal Mill
2) Vertical Mill
3) Drill Press
4) Lathe
5) Surface Grinder
aThis can be done for each dimension, for each machine, and for total.
33
different people. The retest reliability coefficients computed in this
study9 should not be taken to represent what the true test-retest
reliability of this performance test is. Instead they represent the
extent to which subjects' scores are stable over a period of time which
allowed the testees to have different amounts of training and practice
on the difference machines.
9Test-retest reliability was computed on tolerance and finish
for each dimension, for each machine, and for total.
RESULTS
Inter-Judge Reliability
Tolerance scores are reliable: The data on inter-judge relia-
bility are presented in Tables 9-14. With only a few exceptions,
inter-judge reliability was very high. Table 9 presents the inter-
judge correlations between the three judges who evaluated the odd-
numbered subjects, and between the three judges who evaluated the
even-numbered subjects. These correlation coefficients were computed
on the raw data (that is, on unscored measurements). These correla-
tion coefficients are affected by extreme measurements (measurements
that are far from specified tolerances) in a positive direction, and
therefore are, in most cases, slightly higher than the correlation
coefficients found in Table 10, which presents inter-judge reliabil-
ities computed on the scored measurements. The inter-judge relia-
bilities on the scored measurements are what need to be examined to
assess the extent of interevaluator agreement, because they are not
affected by extreme deviations from tolerance, or by extreme disagree-
ment between evaluators on only one or two measurements, in the way
that the raw score inter-judge reliabilities are. Furthermore, raw
34
35
TABLE 9.--Inter-judge reliability of tolerance scores«-Raw measurements
(computed on first 68 subjects).
Odds Evens
Evaluators *
1 99 1 00 1.00 99 99 .98
2 99 1.00 1.00 99 98 .96
Horizontal 3 1.00 .99 .99 1.00 1.00 1.00
Mill 4 1.00 1.00 1.00 1.00 1.00 1.00
5 96 99 .94 l 00 1 00 1.00
6 98 98 .97 l 00 l 00 1.00
1 l 00 .86 86 1.00 l 00 l 00
2 1 00 .87 87 1.00 1 00 1 00
Vertical 3 97 .98 96 .99 99 1 00
"11] 4 .99 .99 .99 .99 .98 .98
5 72 .98 .69 99 99 99
6 93 .92 .99 97 49 39
7 .92 .95 .95 99 92 92
1 98 .99 .98 1.00 1.00 1.00
2 1 00 1.00 1.00 .98 99 98
Drill 3 1.00 1.00 .99 .97 .99 .98
Press 4 l 00 1.00 99 1.00 1 00 l 00
5 99 1.00 .99 1.00 1.00 1 00
6 99 .98 .98 .96 93 91
1 .97 99 97 1.00 1.00 l 00
2 .95 96 1 00 1.00 l 00 l 00
Lathe 3 Y 95 95 .99 .99 l 00 99
4 .76 55 .49 87 95 93
5 .75 90 .86 93 93 95
1 97 99 .97 99 99 99
Surface 2 .99 .97 .97 .93 .99 .92
Grinder 3 .99 .99 .99 .73 .93 .86
4 .97 99 .97 83 93 96
*For a description of what tasks the dimension numbers in this column
represent, see Appendix B.
36
TABLE 10.--Inter-judge reliability of tolerance scoresn-Scored measure-
ments (computed on first 68 subjects).
Odds Evens
Evaluators
1 & 6 l & 7 6 & 7 2 & 4 2 & 5 4 & 5
1 92 94 .90 .90 96 9O
2 96 98 .94 85 87 73
Horizontal 3 .84 .79 .78 .84 .77 .89
Mill 4 .87 .89 .84 .86 .83 .83
5 81 76 71 .87 9O 92
6 90 72 68 .90 88 98
1 87 89 .88 94 91 89
2 84 84 .89 .90 92 94
Vertical 3 .82 .85 .72 1.00 .98 .98
Mill 4 .77 .82 .72 .88 .76 .88
5 89 89 .88 1.00 1.00 1 00
6 82 .76 .82 84 94 77
7 85 89 .82 95 93 95
l .72 77 .94 .63 87 75
2 .94 92 .98 .85 95 89
Drill 3 .83 88 .85 81 88 83
Press 4 .80 .89 86 .98 98 1 00
5 .91 93 98 .96 96 95
6 .98 96 98 .90 92 88
l .66 71 52 .82 84 79
2 .77 77 84 .78 93 84
Lathe 3 .67 77 77 .74 .93 81
4 -.29 - 13 .22 .33 63 69
5 -.21 - 38 .24 58 40 50
------------------------------------
1
Surface 2 . . . . .
Grinder 3 .86 .91 .90 .94 .88 .88
4
37
scores cannot be used as performance test scores because they do not
reflect "skill," or how close the testee's measurements come to spe-
cified tolerances.
Table 11 presents inter-judge correlations by machine for the
three odd and three even evaluators on raw measurements. Table 12 is
the analogue of Table 11 for scored measurements. These tables reveal
that inter-judge reliability is very high (in the .90's) for all the
machines except the lathe. Table 13, which presents the average inter-
judge reliability for each measurement, corrected by Spearman-Brown for
three evaluators, also bears out this finding. As can be seen in
Table 13, all the tasks have reliability coefficients in the .90's
except for the lathe. Furthermore, the evaluators of the odd-numbered
subjects had lower agreement than the evaluators of the even-numbered
subjects. This was tested by the Wilcoxon matched-pairs signed-rank
test (p < .01 for a two tailed test). Although all of the lathe mea-
surements produced agreements that were somewhat less than agreement
on measurements of other machines, measurements number four and five
were far below. These two measurements were the inside and outside
chamfers. The process of measuring these dimensions required that the
evaluator make a very fine measurement using a scale, which is a non-
precision instrument. One other measurement required the use of a
scale--the diameter of the countersink on the drill press task (drill
press measurement number 6). The reliabilities of this measurement
38
TABLE ll.--Inter-judge reliability of tolerance scores by machine—~raw
measurements.
Odds Evens
Evaluators
1 & 6 1 8 7 6 & 7 2 & 4 2 & 5 4 & 5
Horizontal Mill .99 .99 .98 1.00 1.00 .99
Vertical Mill .97 .96 .94 .99 .95 .94
Drill Press .99 .99 .99 .99 .99 .99
Lathe .84 .83 .88 .96 .98 .97
Surface Grinder .99 1.00 .99 .90 .97 .96
TABLE 12.--Inter-judge reliability of tolerance scores by machine--
scored measurements.
Odds Evens
Evaluators
l & 6 1 & 7 6 a 7 2 & 4 2 & 5 4 & 5
Horizontal Mill .93 .93 .91 .93 .92 .96
Vertical Mill .92 .93 .94 .98 .97 .96
Drill Press .93 .94 .97 .92 .97 .95
Lathe .45 .45 .59 .78 .85 .85
Surface Grinder .95 .95 .95 .97 .97 .97
39
TABLE 13.--Average inter-judge reliability of each tolerance dimension
(corrected by Spearman-Brown formula for three judges).
Evaluators Odds Evens
1 .97 .97
2 .99 .93
. . 3 .92 .94
Horizontal M111 4 .95 .94
5 .91 .96
6 .91 .97
l .96 97
2 .95 .97
3 .92 1.00
Vertical Mill 4 .91 .94
5 96 1.00
6 92 .94
7 95 .98
1 97 .90
2 98 .96
3 95 .94
Drill Press 4 94 .99
5 98 .98
6 99 .96
l .84 .93
2 .92 .94
Lathe 3 .89 .93
4 '.23 079
5 -.45 74
-—---- ------------------------------
Surface Grinder
40
were higher than that of the chamfer. This can be explained by two
factors. One is that the measurement of the countersink dimension can
be read much easier than the chamfers. The scale is simply rested flat
on the workpiece. With the chamfers, it is not so easy. The scale has
to be held on an angle and is therefore subject to errors caused by
unsteadiness of the evaluator's hand and by placing the scale on the
workpiece at the wrong angle. The second, and most important, factor
is that the tolerances for the chamfer were more stringent. This has
a great effect on the reliability of the sggrgg_measurements. The
inter-judge reliabilities on the raw measurements for the chamfers
(Table 9) were much higher than those for the scored measurements
(Table 10).
Finish scores are reliable: Table 14 presents inter-judge
reliability for finish scores. Most correlations were in the .70's
and .80's after being corrected by Spearman-Brown for two judges.
There was only one very low correlation (the third lathe dimension--
finish of the chamfer), and this was low only for the judges of the
odd-numbered subjects. Overall, these inter-judge correlations may
be considered adequate, but not impressive in magnitude. It is the
more subjective elements involved in finish evaluations that can
explain why finish reliability is not extremely high.
Non-tradesmen can evaluate reliably: All evaluators in this
study were non-tradesmen, had no previous experience in using the
41
TABLE l4.--Inter-judge reliability for finish dimensions.*
Machine Dimensions Judges 3 & 8 (odd) Judges 5 & 9 (even)
Horizontal
M111 1 .62 (.77) .66 ( 79)
Vertical 1 74 ( 85) 53 ( 74)
”1" 2 71 ( 83) 72 (.84)
Dr111 1 84 ( 91) 81 ( 90)
Press 2 51 (.68) 86 ( 80)
l 66 ( 79) 70 ( 82)
Lathe 2 59 (.74) 69 (.81)
3 .11 ( 21) .67 ( 80)
Surface 1 46 ( 63) 71 ( 83)
Gr'"der 2 .54 (.70) .48 (.65)
*r's in parentheses are corrected by Spearman-Brown formula for two
judges.
42
instruments required for tolerance evaluation, and were previously un-
familiar with the machining process. Yet with only a moderate amount
of training on the tolerance measurements (one instructional period
lasting about 20 minutes for each one of the five tasks), and with
initial supervised practice on about five workpieces, evaluators were
able to score the performance test reliably--mostly in the .90's for
tolerance, somewhat lower for finish.
Tolerance evaluations are more reliable than finish evalua-
tions: Tolerance measurements were consistently evaluated more re-
liably than finish. Reliabilities for tolerance were mostly in the
.80's and .90's. For finish they were mostly less than .80. Table 15
presents the data which show that differences between tolerance and
finish reliabilities were highly significant on all machines but the
lathe.
Biases in evaluators' measurements are rare: Tables 16 and 17
present means and standard deviations of raw and scored measurements
for all dimensions. The dimensions which are in parentheses are the
ones which are relevant to testing whether certain evaluators have
measurement biases. The data (in Table 17) was first examined to see
if certain evaluators had consistent biases-~too high or too low. Only
three instances of consistent biases were found: evaluator number 1
had the lowest mean measurement on three of the four micrometer dimen-
sions; evaluator number 4 had the highest mean measurement on three of
43
TABLE 15.--Comparison of inter-judge reliabilities for tolerance and
finish.a
Task
Mean Tolerance Mean Finish p-value for difference
Reliability Reliability between r's
Horizontal
M111 .93 .64 p < 00005
Vertical
Mill .95 .6875 p < 00003
3”” .9467 .755 p < 0002ress
Lathe .6616 .57 p < 25 (N.S.)
SurfaceGrinder .96 .5475 p < 00001
Total .8896 .64 p < .002
aAverage N for tolerance was 33; average N for finish was .60.
Variance of the statistic was computed by the formula:
Var =
1
+
N - 3 °2
44
TABLE l6a.--Means and standard deviations of raw measurements (odds).
Judge
7
Task 6
Mean 5.0. Mean S.D. Mean 5.0.
l 9996 765 147.064 10008.235 145.733 9992.059 147.793
Horizontal 2 9992 353 150.471 9997.059 145.539 9986.176 151.590
Mill 3 6233.588 241.044 6226.794 239.941 6243.588 244.869
(N = 34) 4 6233 000 242.337 6206.059 241.510 6229.500 242.054
5 1287 294 65.759 1292.822 72.161 1281.059 66.068
6 1279 824 66.363 1281.676 66.930 1272.000 69.404
1 2435.647 291.507 2435.059 292.058 2465.118 341.365
2 2436.088 288.757 2434.735 289.139 2462.618 338.861
Vertical 3 999.882 11.654 1002.147 9.783 999.824 11.746
Mill 4 631 353 16.331 631.829 16.813 630.765 16.329
(N = 34) 5 631 382 16.073 629.324 25.637** 632.353 16.562
6 632 735 16.497 633.176 19.196 634.088 19.318
7 1500 676 10.991 1497.794 11.687 1499.971 10.923
1 7603 059 84.140 7592.235 88.965 7594.824 86.16
Drill 2 17499 412 242.307 17496.000 243.100 17498.412 242.816
Press 3 17538 206 250.421 17532.765 250.620 17540.029 248.081
(N = 34) 4 9987 706 235.356 9965.059 236.767 9981.912 225.382
5 10006 824 231.139 9990.353 233.268 10002.794 225.249
6 97 382 5.151 96.882 5.155 97.294 5.119
1 19975 094 74.429 19960.594 70.655 19974.469 74.466
Lathe 2 8034.594 117.620 8013.437 124.831 8021 750 122.688
(N = 32) 3 8040.844 114.440 8019.156 114.283 8027.781 122.348
4 46 656 14.177 65.594 15.801 64.625 19.021
5 47 750 19.444* 65 625 28.014 66.656 27.645
Surface 1 7193 059 11.735 7192.559 12.446 7192.618 11.649
Grinder 2 7193 000 ll.702** 7191 912 12.001 7193.235 11.995
(N g 34) 3 7192 882 11.697** 7192.029 11.681 7192.412 11.259
4 7193 235 12.433 7191.500 12.188 7192.735 12.013
*p < .05
**p < .01
TABLE l6b.--Means and standard deviations of raw measurements (evens).
45
Judge
Task
Mean S.D Mean 5.0 Mean 5.0
1 10023 500 95.808 10022 250 96 695 10033 281 97 160
Horizontal 2 10025 844 96.355 10019 125 92 914 10028 281 97 473
Mill 3 6268 406 378 403 6273 125 377 672 6258 375 379 511
(N = 32) 4 6269 594 379 776 6272 750 377 569 6256 500 379 701
5 1457 687 513 807 1457 219 514 158 1454 187 515 331
6 1412 562 462 461 1413 812 462 089 1414 375 462 126
1 2451 118 233 393 2447 000 232 978 2452 412 233 853
2 2454 235 236 302 2450 118 233 760 2454 882 233.798
Vertical 3 1006 412 27.612 1006 678 28.017 1006 382 28.072
Mill 4 632 794 18.902 631 029 18.379 631 029 19.268
(N = 32) 5 632 176 15.392 630 912 14.958 630 647 15.206
6 634 382 14.625 633 118 14.614 634 735 19.048
7 1495 235 19.085 1495 588 19.113 1496 235 19.217
1 7704 529 392 551 7689 882 393 022 7701 176 392 490
Drill 2 17535 029 166 651 17552 618 168 649 17527 529 162 987
Press 3 17562 735 165 622 17576 294 164 943 17563 706 164 405
(N = 34) 4 10067 441 299 925 10077 912 306 188 10063 176 298 323
5 10095 882 292 665 10110 971 297 768 10094 324 300.948
6 97 735 4.147 98.059 3.629 97 735 4.266
1 20009 545 228 157 20014 485 215 148 20008 242 219 314
Lathe 2 8036 879 168 234 8029 727 169 593 8035 848 168 820
(N = 33) 3 8005 970 148 053 8022 909 142 520 8006 788 142 473
4 65 424 23 652 68.364 23.324 65 273 20.674
5 66 000 22 409 66 485 24.952 70.303 23.468
Surface 1 7191 853 13.855 7192 206 12.395 7191 676 11 846
Grinder 2 7191 618 13.844 7192 824 15.544 7191 324 12.084
(N = 34) 3 7188 059 30.086 7192 794 10.295 7189 588 17 119
4 7186 441 35 290 7192 353 11 430 7190 706 15.038
46
TABLE 17a.--Means and standard deviations of scored measurements (odds).a
Judge
Task 1 6 7
Mean S.D. Mean S.D. Mean 5.0.
(1) .853 .845 .853 .845 .882 .867
Horizonta] (2) .882 .857 .882 .867 .853 .845
M11] 3 .971 .785 1.000 .804 .794 .796
(N = 34) 4 1.000 .840 1.000 .767 .971 .822
5 1 176 890 1.118 .867 1.029 .857
6 1 175 821 1.205 .867 1.088 .818
1 941 838 971 .923 941 906
2 1 029 891 941 .906 941 .905
Vertical 3 1 118 796 1.324 .794 1 118 .758
Mill 4 1 059 998 941 .998 1 118 .993
(N = 34) 5 1 118 993 1 000 1.000 1 000 1.000
6 1 118 993 1.175 .984 1 118 .993
(7) 1 382 841 1.353 .836 1 265 851
1 882 993 1.059 .998 1 000 1 000
Uri], 2 794 867 .755 842 735 815
Press 3 755 769 .735 779 675 755
(N = 34) 4 547 723 .705 787 647 762
5 647 800 .706 787 575 794
5 1 529 848 1.500 849 1 529 813
(1) 1 281 .874 1.000 865 1 155 833
Lathe 2 1 031 .918 1 031 883 959 883
(N = 32) 3 906 .879 875 .781 875 857
4 313 .725 525 .927 688 950
5 313 .725 875 .992 875 992
(1) .457* .502 .543 .602 .571 .645
2:1:325 (2) .514*** .592 .500 .685 .629 .580
(N g 35) (3) .500 .800 .557 .715 .557 .754
(4) .543 .731 .557 .754 .543 .690
aDimensions in parentheses were used to test whether certain judges had
measurement biases.
*p < .10
**p < .05
***p < .025
47
TABLE 17b.--Means and standard deviations of scored measurements (evens).
Judge
Task 2 4 5
Mean 5.0. Mean 5.0. Mean S.D.
(1) 1.094 .843 1.125 .893 1.094 .843
Horizonta] (2) 1.094 .843 1.125 .857 1.031 .847
Mi11 3 .969 .809 1.094 .879 .969 .847
(N = 32) 4 .875 .740 .969 .883 .969 .809
5 1.063 .899 1.094 .914 1.094 .879
6 969 847 875 857 906 879
1 1.176 .856 1.206 832 1 147 912
2 1.147 .845 1.176 821 1.147 879
Vertica1 3 1.000 .908 1.000 907 971 923
M111 4 1.059 .998 1.059 998 1.059 998
(N = 34) 5 1.059 .998 1.059 .998 1.059 998
6 1.118 993 941 .998 1 059 998
(7) 1.000 939 912 .887 941 906
1 529 882 765 .972 647 936
Dri11 2 941 725 971 .857 1 000 767
Press 3 794 796 735 .740 824 785
(N = 34) 4 647 800 618 .768 618 768
5 676 794 618 .768 618 768
6 1 559 735 1 676 .629 1 588 771
(1) 879 844 .909 .900 758 818
Lathe 2 667 765 .788 .844 758 740
(N = 33) 3 818 796 .909 .900 788 807
4 606 919 .909 .996 848 988
5 424 818 .667 .943 485 857
(1) 735*** 779 .912 818 853 772
2:1:32: (2) 794 795 .765 .807 824 821
(N = 34) (3) 853 879 .882 .832 1 000 840
(4) 824** 856 .912 .818 912 887
*p < .10
**p < .05
***p < .025
48
the four caliper dimensions; eva1uator number 2 had the 1owest mean
measurement on three of the four micrometer dimensions. In order for
these biases to have any practica1 significance they not on1y need to
be consistent in a direction (high or 10w) for an eva1uator, but a1s0
must be of significant magnitude. Matched-pair t-tests were computed
on the data that had consistent biases. The biased eva1uators' measure-
ments were paired with the measurements of the eva1uator whose mean mea-
surement was in the midd1e of the other two measurements. None of eva1-
uator number four's biases were statistica11y significant; two of eva1-
uator number one's measurements were significant (surface grinder 1,
p < .10; surface grinder 2, p < .025); two of eva1uator number two's
measurements were significant (surface grinder 1, p < .025; surface
grinder 4, p < .05). It appears from the resu1ts of this study that
eva1uators' biases are not both consistent enough and high enough in
magnitude to warrant the conc1usion that there are biases in certain
eva1uator's measurements.
Certain eva1uators are not more s1oppy: Comparison of standard
deviations of the eva1uators on each measurement (Tab1e 16 and 17)
showed 1itt1e s1oppiness of the eva1uators. F-tests were done on a11
dimensions on variances of raw measurements and of scored measurements.
The tests on scored measurements produced no significant differences
between variances. Tests on raw measurements produced on1y a few.
Measurement number five on the vertica1 mi11 showed eva1uator number
49
six to be more s1oppy (p < .01). Measurement number five on the Tathe
showed eva1uator number 1 to be more s1oppy (p < .05). Measurements
three and four on the surface grinder showed eva1uator number 1 to be
more s1oppy (p < .01). Examination of the standard deviations revea1ed
no trends which may have indicated that a particular eva1uator was more
s1oppy than the others.
Coefficient a1phas by judges for the toTerance scores are found
in Tab1e 18. Using Fisher‘s r to z transformations, the highest coef-
ficient within each task was compared with the 1owest coefficient. No
significant differences were found, nor were any trends found by exami-
nation of the data that wou1d indicate one particu1ar judge as being
s1oppier than the other two.
TABLE 18.-~Coefficient a1pha by judge for toTerance scores.
—’¥ ‘-:‘ ‘-
‘ i r“
“-12:11“ 2:11: 2:23:32:
1 .68 .76 .63 .23 .89
Odds 6 .67 .74 .55 .46 .91 34
7 69 72 .60 41 91
2 86 78 .76 62 91
Evens 4 .87 .73 .72 .51 .91 32
5 .82 .74 .76 .49 .93
50
Finally, examination of the inter—judge re1iabi1ities
(Tab1es 9, 10, 11, and 12) revea1s no trends that wou1d indicate a
particu1ar judge is s10ppier than the two other judges with whom he
was compared. Because no judge was consistent1y s1oppier, no signifi-
cance test on the reported correiations was carried out. The conc1u-
sion must therefore be drawn that measurement errors are random and
equai between judges in the 1ong run.
Measurements do not become more re1iab1e over time: The data
presented in Tab1e 19 does not indicate that measurements become more
re1iab1e over time. The on1y possibie exception to this is the two
measurements (3 and 4) on the horizonta1 mi11, where eva1uators had to
use siiding para11e1s to measure the width of the sTot; -.27 is sig-
nificant at the .05 1eve1 and -.24 is significant at the .07 1eve1
(both one-taiTed tests).
Interna1 Consistency Re1iabi1ity
Interna1 consistency re1iabi1ity is shown in Tab1e 20. Within
each machine by to1erance and by finish, interna1 consistency is fair1y
high. The on1y deviation from this is the re1iabi1ity of the 1athe
to1erance measurements. This finding can easi1y be exp1ained by the
1ack of inter-judge re1iabiiity on this machine, which has a direct
bearing on the extent of interna1 consistency re1iabi1ity. 0vera11
51
TABLE l9.--Correlations of absolute deviations of scored measurementsa
with time.b
5.....1... “°"Ji??“‘ “1:11“ 3:211 Lathe 2:21.32?
1 .16 .14 -.22 —.03 .12
2 -.03 .11 -.09 .04 .17
3 -.27** .13 -.10 .06 .10
4 -.24* .25 .00 .09 .09
5 - . 08 .13 - .04 -.15
6 -.15 .13 -.11
7 -.02
a
Absolute deviations of scored measurement = measurement of evaluator
one minus measurement of evaluator two.
bNegative values indicate less slop with time.
Positive values indicate more slop with time.
*p < .10
**p < .05
52
TABLE 20.--Interna1-consistency reliability.
Tolerance & Finish
Machine Tolerance Finish
2 itema Multi-itemb
Horizontal Mill .76 .78 .44 .82
Vertical Mill .76 .84 .49 .84
Drill Press .65 .69 .55 .76
Lathe .45 .63 .49 .65
Surface Grinder .91 .77 .27 .86
Total (by machine) .56 .59
Total (by dimension) .76 .66
Total--2 items: total tolerance; total finish .74
Total--5 items: (total tolerance + finish) X 5 machines .70
Total--10 items: (total tolerance) X 5; (total finish) X 5 .73
Total-~38 items: by tolerance dimensions (28); by finish
dimensions (10) .82
aTwo items are finish score and tolerance score.
bItems are each measurement and evaluation within tolerance and
finish.
53
reliability (of the total test) is also high. The actual internal
consistency reliability of the total test lies somewhere between .70,
which is the reliability of a five item test where each machine (toler-
ance plus finish) is an item, and .82, which is the reliability of a
38 item test where each tolerance measurement and each finish evalua-
tion are items. An inflated estimate of reliability is .82 because of
correlated errors within each task. The two item test and the ten item
test have a smaller degree of correlated errors, and the five item test
has no correlated errors. An underestimate is .70 because the test was
actually longer than five items.
Correlation Matrix
Data in the correlation matrix in Table 21 can be used to
assess the relationship of performance on one machine to the next and
relationship of tolerance to finish. The overall correlation between
tolerance and finish was .59. Correlations between machines on toler-
ance and on finish are very similar. The surface grinder had the
smallest relationship with the other machines. This finding may change
if a surface grinder were to be used that was in better condition than
the one used in this study.
TABLE
21.-Intercorre1ations
of
scores
within
the
performance
test.
NM r—NMQ'LD r-NMG'LO r—NMQ'LO
1.00
.59
.75
.61
.34
.11
1.00
1.00
.41
.11
.81
.33
.36
.05
1.00
.42
.22
.33
.35
.83
.21
.20
.50
.18
.83
.47
.14
1.00
.12
.36
.39
.38
.81
.13
.31
.07
.31
.81
.05
1.00
.18
.15
.14
-.08
.76
.29
—.03
.23
.28
.76
1.00
.34
1.00
.28
1.00
.27
.30
.13
.28
.33
.28
.29
.14
.17
.14
.27
.32
.30
.46
.09
.22
1.00
19
-.05
1
.45
.16
.07
.38
.41
.02
.00
-
.14
.32
-.08
.00
.31
.04
.15
.26
.15
.13
1.00
Performance
Test
1)
Total
tolerance
plus
finish
score
2)
Total
tolerance
score
3)
Total
finish
score
Tolerance
Plus
Finish
Scores
for:
1)
Horizontal
Mill
2)
Vertical
Mill
3)
Drill
Press
4)
5)
Lathe
Surface
Grinder
C.
Tolerance
Scores
for:
1)
Horizontal
Mill
2)
Vertical
Mill
3)
Drill
Press
Finish
Scores
for:
1)
Horizontal
Mill
2)
Vertica1
M111
3)
Drill
Press
AA
QLO
Lathe
Surface
Grinder
Lathe
Surface
Grinder
54
55
Test-Retest Reliability
Stability across time, as can be seen in Table 22, is virtually
nonexistent. No correlations between time one and time two were very
high, and the overall picture is one of no test-retest reliability for
the time interval in the present study. However, as pointed out earlier,
this time interval was far too long to draw any conclusions about
whether or not this performance test does in fact have test-retest
reliability.
TABLE 22a.--Test-retest reliability by machine.
Machine To1erance Finish
Horizontal Mill _ .04 .34
Vertical Mill .47 -.10
Drill Press .41 .00
Lathe -.05 .16
Surface Grinder .23 .15
Total -.07 .33
56
TABLE 22b.--Test-retest reliability by dimension.
Dimension Tolerance Finish
Horizontal Mill .32
-.40
.26
.07
.00
Vertical Mill
.‘-------------------—----- ---------
Drill Press
‘---------------‘--------------—---.
------------- -----------------------
Surface Grinder
DISCUSSION
Evaluating_the Evaluators
The results of this study bear out the hypotheses that non-
tradesmen can reliably evaluate (measure and judge for finish) the end
products of a test for machinists in the metal trades. The fact that
the questions concerning sloppiness and biases on the part of evalu—
ators were not affirmed by the data lends further support to the use
of non-tradesmen in evaluating this kind of performance test. Those
measurements which did not have inter-judge reliabilities as high as
most of the others, and the measurers who did not have reliabilities
as high as the others, point out the need for training of and feedback
to the eva1uators. The fact that the measurers of the odd-numbered
workpieces did not have as high agreement as those of the even-numbered
workpieces could possibly be explained by differences in the workpieces
themselves--the odd-numbered ones just by chance were harder to measure.
But another likely explanation is that these evaluators were not suffi-
ciently trained. The process of training involves showing the evalu-
ators how to use and read the instruments, showing what part of the
workpiece to measure, and placing a psychological emphasis on such
57
58
factors as being careful or meticulous in making measurements and mea—
suring in a standard or consistent manner. Only a small amount of
effort was put into the process of training the evaluators to evaluate
as accurately as possible. Had more of an effort been put into train-
ing the eva1uators, inter-judge reliability may have been higher, and
equally high for all measurers and measurements. However, there was no
way of empirically showing from the data in this study that an insuffi-
cient training effort was a major cause of unreliability in the mea-
surements.
The data, in general, did not support the question of evalu-
ators becoming more reliable over time. The probable explanation for
this is also related to training. Evans (1951) reported that feedback
to evaluators about the accuracy of their measurements is essential if
inter-judge agreement is to remain high. Evans found that raters im—
prove in sets of measurements in immediate succession (eg., set 1 has
the least accurate measurements; set 2 has intermediate accuracy; set 3
has the most accurate measurements). Feedback came immediately after
each set. However, when there was a long interval (more than one hour)
between sets of measurements accuracy went down, and when there was a
very long interval (l0 days) accuracy was worse than in the beginning
(immediately following an initial 30 minute training period). In light
of EVans' findings, it is not surprising that evaluators did not become
more reliable over time. Little or no feedback was given to evaluators,
59
and measurements were not all done in immediate succession. Any prac—
tice effect that may have been operating was probably offset by these
two factors, resulting in no negative correlations of absolute devia—
tions of evaluators with time.
Most of the same factors that operate on tolerance evaluations
also operate on the finish judgments, with one important addition being
an element of subjectivity. Finish evaluations were not measured, but
instead were subjective judgments on how smooth the workpiece was. The
process was made less subjective by introducing the system of bench-
marks, which enabled evaluators to make relative, rather than absolute,
judgments. The eva1uator's major difficulties, and probably the major
source of disagreement, came when a workpiece fit between two cate-
gories. One way to make finish evaluations more reliable would be to
have more benchmarks within each category. The evaluator would thus be
more likely to find a benchmark that would be closer in finish to the
workpiece he or she is evaluating. One final explanation of why finish
reliability was not as high as tolerance reliability is that finish was
evaluated long after tolerance, allowing some workpieces to become
rusty.
The results of this study pertaining to inter-judge reliability
are similar to those reported in the literature. Inter-judge relia-
bility in this study appears to be about the same as that found by
Bornstein et al. (1957) as shown in Table 9. Siegel (1954a) and
60
Stuit (1947) report inter-judge reliabilities in the .90‘s. While most
of the reliability coefficients in the present study reached the .90‘s,
there were many exceptions to this. Kelly (1955) reported on subjec-
tive kinds of judgments, very similar to the finish scores in this '
study. However, she used a method of paired-comparisons of 10 glass
panels which resulted in rank-orderings of the panels. This was re-
peated twice for each judge and rankings within each observer were
correlated. Therefore, there is no way of directly comparing Kelly's
results with the present study. Kelly's conclusion was that these
judgments can be made reliably, a conclusion which is in agreement with
the results of this study.
Evaluating the Test
Coefficient alphas for tolerances by machine and for finish by
machine were at least adequate, with the exception of the lathe. The
low internal consistency reliability of the lathe was explained in the
results section by the fact that inter-judge reliability was low. The
surface grinder had the highest tolerance reliability. This would be
expected because the surface grinder task was by far the least complex,
and also had the greatest degree of correlated errors. No layout was
necessary on this task and only one grinding process was called for.
Very high coefficient alphas were not expected, nor were they desirable,
61
across tolerance and finish or across machines. Had alphas been high,
it would have been an indication that the performance test was only .
tapping one dimension. Tolerance and finish, and ability on each
machine were expected to be related to some extent but not extremely
highly related. 50 alphas were not expected to be very high. As can
be seen in Table 16, correlations between machines (tolerance plus
finish) and correlations between tolerance and finish were moderately
positive (significant in most cases), but not extremely high. The
pattern of correlation did not differ much from those reported by
Campion (1972), reported in Table 1, or by Siegel and Jensen (1955),
reported in Table 5, but were generally higher than those reported by
Bornstein et al. (1957), reported in Table 2.
Overall reliability of the performance test (total including
tolerance and finish on all five machines) is estimated to be about
.76. It would be somewhere between alpha of a five item test (.70)
where total score on each machine is an item, and a 38 item test (.82)
where all tolerance and finish dimensions are items. The former is
an underestimate because the test is in fact longer than five items,
and the latter is an overestimate because the errors made in many of
the 38 items are correlated with errors made in other items. Alpha of
the performance test used in this study is slightly higher than the
alphas calculated from the data in the studies of Campion (1972),
Bornstein et al. (1957), and Siegel and Johnson (1955), which were
62
.40, .62, and .59, respectively. Here, the appropriate comparison
would be with .70, the alpha of a five item test.
Evaluating_the Test (The
Second Time Around)
No evidence of test-retest reliability was found in this study.
Low test—retest reliability can often be explained by restriction in
range. In this study, however, the lack of stability cannot be ex—
plained by a restriction in range problem. The 21 retest subjects did
not differ in variance from the original 68 subjects on the first
testing, nor was the variance restricted on the second testing, which
nay have been expected as a result of practice that would create a
ceiling effect. Due to constraints placed on the researchers by the
participating company, and also due to practical considerations by the
researchers, the time period between test and retest was far too long
for all testees. The range was from 7 to 20 weeks, averaging about
14 weeks. The ideal period between test administrations should have
been one to two weeks. Any longer period of time would result in
differential learning and practice on the different machines. If a
person has not had any practice on a machine for three or four months,
then he would undoubtedly be rusty on that machine. If another person
had had many hours of experience on a particular machine, then his
score would be expected to improve. The training program for the
63
apprentices in the retest company lacks any kind of uniformity across
apprentices, so some of the subjects had no experience, while others
had experience on one or more of the machines in the interim. (This
lack of uniformity was confirmed through examination of machine time
logs of the various retest subjects for the interim period.) With
these conditions, this performance test cannot be expected to have
high stability over long periods of time.
Conclusions
The performance test in this study was one constructed to be
reliable across judges, to be internally consistent, and to be stable
over time. The data showed the test to be reliable across judges and
to be internally consistent, but not stable over the extreme time in-
terval that was employed. Further research needs to be conducted to
examine retest reliability. The interim period should be from one to
two weeks if one is to expect a fair degree of stability. A larger
sample would also be a desirable component of any further research on
retest reliability. All subjects who were tested the first time,
rather than only volunteers, should be tested the second time.
While inter-judge agreement was high in almost all respects,
careful training of evaluators and a feedback system should be neces-
sary elements in the use of performance tests. More extensive training
64
is especially important where judgments are somewhat subjective in
nature. (In this study it was the benchmark system for evaluating
finish.)
The generalization of results of this study to performance
testing in occupations other than the metal trades is speculative and
in need of further research. However, it appears that a test where
end products are what are being evaluated can be scored reliably and
can be constructed to yield adequate reliability.
A number of implications can be drawn from this study which
pertain to the use of performance testing by government, industry, or
vocational counseling. First, this study takes a thorough look at
reliability issues involved in the construction of a performance test--
more thorough than past studies found in the literature. Future re-
search should contain in it all elements of reliabilities that are con-
tained in this study. The results of this study lend support to the
feasibility of using performance tests--non-tradesmen were able to score
the test reliably at their own convenience. The use of performance
testing by government in occupational licensing or in selection and
placement of job applicants can only be justified if such performance
tests are reliable. This study lends support to use in these areas--
this kind of test can be constructed to be internally consistent and
to be scored reliably. Use of this kind of test by governmental
agencies may result in improvements in their own functions and
65
effectiveness. Industry can use such a test to evaluate training pro-
grams or to evaluate individual progress (provided that retest relia-
bility is established). The fact that reliable performance tests can
be constructed lends support to the performance testing movement, which
argues that tests which sample job skills are often more valid and
fairer to minorities than traditional paper-and-pencil tests.
APPENDICES
APPENDIX A
TRANSCRIPTS 0F TAPED INSTRUCTIONS AND
BLUEPRINTS OF TASKS
69
Lathe Instructions
These are the instructions for the Lathe Task 2. Please listen
to these instructions in their entirety before starting this task. The
instructions are as follows:
Examine the workpiece drawing. It will be labeled ”Lathe
Task 2." You are to face the workpiece to the length cited. You are
also to bore it out to the dimension shown. And finally, you are to
chamfer one end of the workpiece both internally and externally as in-
dicated. Do your best to stay within the tolerances shown.
Your performance in this task will be evaluated on two dimen-
sions: quality and speed. You should therefore work as quickly as you
feel is possible to turn out high quality work. You will have only one
workpiece.
When you finish, place your identification sticker on the work-
piece and place the workpiece in the bin provided. Press the sticker
on carefully, since it sometimes does not stick well.
1 Next, clean up the machine, leaving it in the condition you
found it for the next person.
This completes the instructions. If you need to, you may re-
wind the tape and listen to any portion of the instructions again.
When you are completely finished, rewind the tape and turn off
the tape recorder. Then report to the administrator for assignment to
another machine.
This is the end of the tape.
HORIZONTALANDVERTICAL
MILL
TASKS
+1.5"
+
.005
++1"
i.005+
400+
9
T6
12:332
"I":
'T'.
.750"
+.000
.I.--------
‘-l..--
-0002
U 1
L..-...J
+1
“N
No
I.291
+1"
1.002+.302
-.000
(k—
—.
2"
i.003——ai
+H°+
71
1"
i51.2.
82°
CSK
n-+-010
.750
_.001
DIA.
HOLE
I L—l.750"
i.005
-Ol
+.OOO
-.002
lV
nl
Au
.
L'_3
in?
’L—2:,333-——>
\j—— --
1
'1'L-..-
I
T----
1----..
l
——|'—
I.750"
J l
72
coc.
'74
441.
3a.:
.
j
moo.
H
:H
'-
xmfi.«mezzo
much—am.
APPENDIX B
TOLERANCE EVALUATION SHEETS,
FINISH EVALUATIONS, AND
EXPLANATION OF DIMENSION NUMBERS
Testee
Evaluator
Horizontal Mill Task
Tolerance Evaluations
Instructions: Using the appropriate instruments, measure and record
each of the workpiece dimensions specified on this sheet. Then place
a circle around the most stringent of the below listed tolerances which
that dimension meets.
Location of Slot measured at end "L" (use calipers)
1" i .002 1" i .004 Neither g1
H3’5?““83
Location of Slot measured at end "R'I (use calipers)
l" i .002 l" i .004 Neither
Width of Slot measured at end "L" (use sliding ll‘s)
(Do not rest sliding parallels on bottom of slot)
.525" : '883 .525" t '33: Neither
Width of slot measured at end "R" (use sliding ll's)
" + .002 u + .004
‘625 - .000 ‘625 - .002
Thickness of unmilled portion of workpiece at end "L" (use
micrometer) (A)
Thickness of milled portion of workpiece at end "L” (use
micrometer) (3)
Depth of Slot at end "L" (A18)
.125" i .002 .125" i .004 Neither
Thickness of unmilled portion of workpiece at end "R" (use
micrometer) (C)
Thickness of milled portion of workpiece at end "R" (use
micrometer) (D)
75
76
Depth of Slot at end "R" (C-D)
.125" i .002 .125“ i .004 Neither
Is the Slot in the correct location? Yes No
Was the cut started at another location on the workpiece? Yes No
Was a deeper or wider cut started but not completed? No Deeper Wider
77
Testee
Evaluator
Vertical Mill Task-~Milling a Pocket
Tolerance Evaluations
Instructions: Using the appropriate instruments, measure and record
each of the workpiece dimensions specified on this sheet. Then place
a circle around the most stringent of the below listed tolerances which
that dimension meets.
Thickness of unmilled workpiece near left end of pocket (use
micrometer) , (A)
Thickness of stock at left end of pocket (use micrometer) (B)
Depth of pocket at left end (A-B)
.. + .002 .. + .004 ..250 _ .000 .250 _ .000 Ne1ther
Thickness of unmilled workpiece near right end of pocket
(use micrometer) (C)
Thickness of stock at right end of pocket (use micrometer) (D)
Depth of pocket at right end (C-D)
u + -002 u + .004 .
.250 _ .000 .250 _ .000 Ne1ther
Length of pocket (use calipers)
l" i .005 l" i .010 Neither
For width of pocket take width of unmilled stock minus width of
milled part (use calipers).
Width of pocket at end "L"
.625 i .005 .625 i .010 Neither
Width of pocket at middle ,
.625 i .005 .625 i .010 Neither
Width of pocket at end "R"
.625 i .005 .625 i .010 Neither
78
Location of pocket from end "L“ (use calipers)
1.5” i .005 1.5" i .010 Neither
Radius of cutter that was used (use shank of 3/8" end mill cutter)
3/16“ Something else
Is the pocket in the correct location? Yes No
Wes an incorrect cut started at another location on the workpiece?
Yes No
Was a wider or longer cut started in the pocket but not completed?
Yes No
79
Testee
Evaluator
Drill Press Task
Tolerance Evaluations
Instructions: Using the appropriate instruments, measure and record
each of the workpiece dimensions specified on this sheet. Then place
a circle around the most stringent of the below listed tolerances which
that dimension meets.
Hole Diameter (use telescoping gauge and "mike") (A)
.750" + '0‘”_ 001 Meets 3 Less Stringent Tolerance
Distance from edge of hole to end "L" as measured from top of work-
piece (use calipers, inserting them from the countersink side of the
workpiece) (8)
Distance from edge of hole to end "L" as measured from bottom of
workpiece (use calipers, inserting them from the side of the work-
piece that hasl:ot been countersunk) (C)
Distance from center of hole to end "L" as measured from top of
workpiece (B+l/2A)
1.750" t .005 1.750" i .010 Neither
Distance from center of hole to end "L" as measured from bottom of
workpiece (C+l/2A)
1.750" t .005 1.750" t .010 Neither
Distance from edge of hole to end "C" as measured from top of work-
piece (D)
Distance from edge of hole to end "C" as measured from bottom of
workpiece (E)
Distance from center of hole to end "C" as measured from top of
workpiece (D+l/2A)
l" i .005 l" t .010 Neither
80
Distance from center of hole to end "C" as measured from bottom of
workpiece (E+l/2A)
l“ t .005 l" i .010 Neither
Countersink Dimension (convert to decimal)
l" i 1/32 Too Narrow Too Deep
Was an 82° Countersink used? (Insert 82° Countersink in hole) Yes No
81
Testee
Evaluator
Lathe Task II--Boring, Facing, and Chamfering
Tolerance Evaluations
Instructions: Using the appropriate instruments, measure and record
each of the workpiece dimensions specified on this sheet. Then place
a circle around the most stringent of the below listed tolerances which
that dimension meets.
Length of Workpiece: . Avg. .
(Using the calipers, make several measurements, rotating the workpiece
approximately 120° after each, and record them above. Then compute the
average of these readings and record it. Use this average reading for
purposes of specifying the most stringent of the below listed length
tolerances which the workpiece meets. When making the individual
caliper measurements be sure the calipers are placed close to the edge
of the workpiece and away from the center axis point.)
2" i .001 2" i .002 Neither
Diameter of Bored Hole at end "L" (use telescoping gauge)
+ .001 + .002 ..800 _ .000 .800 _ .00] Ne1ther
Diameter of Bored Hole at end "R“ (use telescoping gauge)
+ .001 + .002 ..800 _ .000 .800 _ .00] Ne1ther
Inside Chamfer (use scale)
1/16" i 1/64 Meets a Less Stringent Tolerance
Outside Chamfer (use scale) . (convert to decimal)
1/16" i 1/64 Meets a Less Stringent Tolerance
Are the inside and outside chamfers on the same end of the workpiece?
Yes No
82
Testee
Evaluator
Surface Grinder Task
Tolerance Evaluations
Instructions: Using a dial micrometer, measure and record each of the
workpiece dimensions specified on this sheet. Then place a circle
around the most stringent of the below listed tolerances which that
dimension meets.
Thickness of Workpiece as measured at Corner "A"
.719'| i .0002 .719" i .0004 Neither
Thickness of Workpiece as measured at Corner "B"
.719" i .0002 .719" i .0004 Neither
Thickness of Workpiece as measured at Corner "C"
.719" i .0002 .719" i .0004 Neither
Thickness of Workpiece as measured at Corner "0"
.719" i .0002 .719" t .0004 Neither
Were both the top and bottom sides of the workpiece ground? Yes No
If not, what was ground?
Note: In making these measurements be sure to place the micrometer in
far enough to avoid burrs on the edges of the workpiece.
Finish Evaluations
Instructions:
pieces to be used as benchmarks on the table in front of you.
83
Testee
Evaluator
For each finish evaluation, place the numbered work-
Be sure
to include all of the numbered workpieces in each benchmark category.
Identify the benchmark category most closely represented by the work-
piece you are evaluating.
in the space provided.
HORIZONTAL AND VERTICAL MILL TASKS
Finish of Pocket Floor:
Write the number of this benchmark category
Benchmark Benchmark Benchmark Benchmark Benchmark
Category #1 Category #2 Categggy,#3 Category #4 Category #5
w.p. #48 w.p. #45 w.p. #46 w.p. #42 w.p. #39
Finish of Sides of Pocket:
Benchmark Benchmark Benchmark Benchmark Benchmark
Category #1 Category #2 Category #3 Category #4 Category,#5
w.p. #26 w.p. #46 w.p. #41 w.p. #49 w.p. #32
Finish of Slot Floor:
Benchmark Benchmark Benchmark
Categony #1 Categgry_#2 Category #3
w.p. #39 w.p. #28 w.p. #34
w.p. #48 w.p. #36 w.p. #35
84
Testee
Evaluator
Finish Evaluations
Instructions: For each finish evaluation, place the numbered work-
pieces to be used as benchmarks on the table in front of you. Be sure
to include all of the numbered workpieces in each benchmark category.
Identify the benchmark category most closely represented by the work-
piece you are evaluating. Write the number of this benchmark category
in the space provided.
DRILL PRESS TASK
Countersink Finish:
Benchmark Benchmark Benchmark Benchmark
Category_#l Category #2 Category #3 Categgry #4
w.p. #28 w.p. #34 w.p. #42 w.p. #36
w.p. #44 w.p. #72 w.p. #54 w.p. #49
w.p. #47 w.p. #62
w.p. #51 w.p. #63
Finish of Hole: (Note: If a ridge or line appears on the hole wall,
indicate the next lowest benchmark category.)
Benchmark Benchmark Benchmark
Category #1 Categoryg#2 Category #3
w.p. #36 w.p. #58 w.p. #42
w.p. #70 w.p. #60
85
Testee
Evaluator
Finish Evaluations
Instructions: For each finish evaluation, place the numbered work-
pieces to be used as benchmarks on the table in front of you. Be sure
to include all of the numbered workpieces in each benchmark category.
Identify the benchmark category most closely represented by the work-
piece you are evaluating. Write the number of this benchmark category
in the space provided.
LATHE TASK
Finish of bored hole: (Note: This is evaluated by both touch and
sight.)
Benchmark Benchmark Benchmark
Category_#l Category #2 Category #3
w.p. #11 w.p. #58 w.p. #10
w.p. #12
Finish of nonchamfered end:
Benchmark Benchmark Benchmark Benchmark
Category_#l Category,#2 Category #3 Categgryg#4
end not faced w.p. #3 w.p. #58 w.p. #9
w.p. #11 w.p. #59 w.p. #56
Finish of chamfer:
Benchmark Benchmark Benchmark
Category #1 Category_#2 Category #3
w.p. #11 w.p. #3 w.p. #7
w.p. #12 w.p. #8
86
Testee
Evaluator
Finish Evaluations
Instructions: For each finish evaluation, place the numbered work-
pieces to be used as benchmarks on the table in front of you. Be sure
to include all of the numbered workpieces in each benchmark category.
Identify the benchmark category most closely represented by the work-
piece you are evaluating. Write the number of this benchmark category
in the space provided.
SURFACE GRINDER TASK
Finish of nonlabeled side: (Note: This is evaluated by running your
fingernail widthwise across the workpiece.)
Benchmark Benchmark Benchmark Benchmark
Category_#l Category #2 Categgry #3 Category #4
w.p. #29 w.p. #28 w.p. #48 w.p. #35
w.p. #37
Chatter on nonlabeled side: (Note: This is evaluated by sight. Tilt
workpiece so that if reflects light. Look
for the extent to which wavey lines appean)
Benchmark Benchmark Benchmark
Category #1 Category #2 Category #3
w.p. #29 w.p. #43 w.p. #37
w.p. #53
Task
Horizontal Mill
Horizontal Mill
Horizontal Mill
Horizontal Mill
Horizontal Mill
Horizontal Mill
Vertical Mill
Vertica1 Mill
Vertical Mill
Vertical Mill
Vertical Mill
Vertical Mill
Vertical Mill
Drill Press
Drill Press
Drill Press
87
Explanation of Dimension Numbers
Dimension
No.
1
wwow-b
Explanation
Location of slot from edge of workpiece to
edge of slot at end "L" using calipers to
make measurement
Location of slot at end "R"
Width of slot at end "L" measured by in-
serting s1iding parallels in the slot and
measuring sliding parallels with the
micrometer
Width of slot at end "R"
Depth of slot at end "L" using micrometer
Depth of slot at end "R”
Depth of pocket at left end using microm-
eter
Depth of pocket at right end
Length of pocket using inside part of
calipers
Width of pocket at end "L" using calipers
Width of pocket at middle
Width of pocket at end "R"
Location of pocket (edge of pocket from
side of workpiece) at end "L"
Hole diameter using telescoping gauge and
measuring it with a micrometer
Distance from center of hole to end "L”
(side of workpiece) using calipers on top
of workpiece (countersunk side)
Distance from center of hole to end "L”
measured from bottom of workpiece
Task
Drill Press
Drill Press
Drill Press
Lathe
Lathe
Lathe
Lathe
Lathe
Surface Grinder
Surface Grinder
Surface Grinder
Surface Grinder
Dimension
No.
4
88
Explanation
Distance from center of hole to end "C"
using calipers on top of workpiece
Distance from center of hole to end ”C"
measured from bottom of workpiece
Countersink dimension (diameter) measured
with a scale
Length of workpiece (average of three
measurements) using calipers
Diameter of hole at end "L" using tele-
scoping gauge and micrometer
Diameter of hole at end "R”
Inside chamfer measured with a scale
Outside chamfer
Thickness of workpiece measured as close
to corner "A" as possible without over-
lapping corner in order to avoid measuring
burrs
Thickness of workpiece at corner ”B"
Thickness of workpiece at corner "C"
Thickness of workpiece at corner ”D"
The letters in quotation marks ("LJ'"R," "C," "A," "B," "D,") refer to
labels placed on the workpieces to standardize measurements (in the case
of the surface grinder) or to make the measuring process easier and to
avoid making errors in measuring the wrong parts of the workpieces.
APPENDIX C
EXPLANATION OF SCORING SYSTEM
AND THO POINT SYSTEM
EXPLANATION OF SCORING SYSTEM
Tolerances were decided on with the help of the project's
machinist consultant. As can be seen on the evaluation sheets in
Appendix C, there were two tolerances for most measurements-~the first
tolerance, which is identical to that shown on the blueprints, and a
second tolerance which is not as stringent as the first. Dimensions
were to be scored 2, l, or 0, depending on whether the testee fell into
the first tolerance, the second tolerance, or outside of the second.
It was realized before subjects were tested that distributions within
each preset tolerance may not be ideal, and that a revision of the
tolerances based on the distributions of the real measurements would
be necessary. These revisions were done, with the original tolerances
as well as the distributions taken into consideration. This scoring
system appears on the following pages. Entered into the table are
number of measurements to be scored of that particular dimension, the
dimension (corresponding to the dimension on the evaluation sheets),
and tolerances underneath number of points to be assigned for that
tolerance. In parentheses are the approximate percentage of testees
falling into that tolerance category.
In the analyses reported in this study, the yes~no questions
(found at the end on each evaluation sheet in Appendix B) were not
89
90
scored. There was little or no variance in these items--very few
testees used the wrong tools or started machining on the wrong part
of the workpiece. Furthermore, no weighting system could be decided
on for these items. Because they added almost nothing to the test,
they were dropped from the analysis.
#_of
meas.
NFPP—
2POINT
SYSTEM
Horizontal
Mill
dimension
2points
location
1.000
+1
.002
(34.8)
.002
.000
.002
(33.3)
width
.625
+1
(25.8)
+1
depth
.125
correct
location
another
cut
deeper
orwider
Vertical
Mill
.002
..000
length
1"
.005
(39.7)
width
.625
location
1.5
.005
(48.5)
radius
3/16"
correct
location
depth
.250
+1
(29.4)
+1 +1
incorrect
cut
started
wider
or
longer
cut
no
1point
+1 +1 +1 +1 +1 +1 +1
.008
(17.9)
.006
(40.8)
.004
(33.4)
yes
no
110
.005
.002
.011
(30.9)
.008
(45%?)
.010
(14.7)
(32.4)
yes
110
0points
other
(47.3)
(33.4)
(33.3)
yes
yes
(38.2)
(29.4)
(55%?)
(26.8)
no
yes
yes
91
#of
meas.
NNF—l—
dimension
diameter
distance
to
"L"
distance
to
"C"
c/s
82°
length
diameter
chamfer
same
end
2POINT
SYSTEM
(cont.)
Drill
Press
2points
.010
.001
1.750
.005
(25)
1.000
.005
(20.6)
1.000
.032
+1
.750
(60.3)
+1 +1
Lathe
.001
(46.1)
000]
.000
.06251:.0157
(66.1)
2”
+1
.8"
+1
(38.5)
‘IID
1point
+1
.015
(25)
.015
(20)
-.064
+1 +1
.003
(14.4)
.004
.003
+1
(23.0)
yes
0points
other
(39.7)
(50.0)
(59.4)
too
narrowor
deep
no
(28.5)
(29.5)
other
(23.9)
110
92
2POINT
SYSTEM
(cont.)
Surface
Grinder
#of
meas.
dimension
3points
2points
1pgint
0points
4thickness
.719
i.0002
(14.7)
i.0004
(16.2)
+1
.0006
(23.5)
(45.6)
1top
and
bottom
yes
no
ground
93
REFERENCES
REFERENCES
Adkins, D. C., Primoff, E. 5., McAdoo, H. L., Bridges, C. F., and
Forer, B. Construction and analysis of achievement tests.
Washington, D.C.: U.S. Government Printing Office, 1947.
Besnard, G. G., and Briggs, L. J. Measuring job proficiency by means
of a performance test. In E. A. Fleishman (Ed.), Studies in
personnel and industrial psychology. Homewood, Ill.: Dorsey,
1967.
Blood, M. R.. Job samples: A better approach to selection. American'
Psychologist, 1974, 22, 218-219.
Bornstein, H., Jensen, B. T., and Dunn, T. F. The reliability of
scoring in performance testing as a function of tangibility of
the performance product. Abstract of paper read at the 1954
APA Convention. American P§ychologjst, 1954, 9, 336-337.
Bornstein, H., Jensen, B. T., Goldstein, L. G., and Dunn, T. F. Tech-
nical research note 75 evaluation of the basic military per-
formance test. Washington, D.C.: Department of the Army, The
Adjutant General's Office, Personnel Research and Procedures
Division, June, 1957.
Boyd, J. L., Jr., and Shimberg, B. Handbook of performance testing.
Princeton, N.J.: Educational Testing Service, 1971.
Campion, J. E. Work sampling for personnel selection. Journal of
Applied Psycholggy, 1972, 56, 40-44.
Cole, N. S. The right question, wrong answers. American Psyehologist,
1974. 22, 219-220.
Dunn, T. F., Bornstein, H., Jensen, B. T., and Tye, V. M. A group
administered performance test of Army basic skills. Abstract
of paper read at 1954 APA Convention. American Psychologist,
1954, 2, 357.
94
95
Equal Employment Opportunity Commission. Guidelines on Employment
Selection Procedures. Washington, D.C., 1970.
Evans, R. N. Training improves micrometer accuracy. Personnel Psy-
chology, 1951, 4, 231-242.
Frederiksen, N. Factors in in-basket performance. Psychological
Monographs: General and Applied, 1962, Q (22) (Whole No. 541).
Gael, S. 0n O'Leary's "Fair employment . . . ." American Psycholo-
gist, 1974, 29, 216-217.
Gordon, J. E. Testing, counseling and supportive services for disad-
vantaged youth. Ann Arbor: Institute of Labor and Industrial
Relations, University of Michigan, 1969, 54—62.
Griggs, et al., vs. Duke Power Company. Supreme Court of the United
States, No. 124, October Term, 1970 (March 8, 1971).
.Havron, D. M., Lybrand, W. A., and Cohen, E. The assessment of in-
fantry rifle squad effectiveness. U.S. Army Personnel Research
Branch, The Adjutant General's Office, Technical Research Re-
port 1087, December, 1954.
Jewish Employment and Vocational Service. Work Samples: Signposts on
the road to occupational choice. Final Report to Manpower
Administration, U.S. Department of Labor. Experimental Demon-
stration Contract No. 82-40-67-40, September 30, 1968.
Jewish Employment and Vocational Service. Job Trials for Personnel
Selection. Final Report to Manpower Administration. U.S.
Department of Labor, Contract No. 82-42-72-08, March 15, 1973.
Kelly, M. L. A study of industrial inspection by the method of paired
comparisons. Psychological Monographs: General and Applied,
1955, §g_(9) (Whole No. 394).
Lawshe, C. H., Jr., and Tiffin, J. The accuracy of precision instru-
ment measurement in industrial inspection. Journal of Applied
Psychology, 1945, 22, 413-419.
McClelland. D. C. Testing for competence rather than for "intelli-
gence." American Psychologist, 1973, 28, 1-14.
96
McPherson, M. W. A method of objectively measuring shop performance.
Journal of Applied Psychology, 1945, 22, 22-26.
Office of Federal Contract Compliance. Regulations on Employee Testing
and other Selection Procedures. Washington: U.S. Department
of Labor, 1971.
O'Leary, L. R. Fair employment, sound psychometric practice and re-
ality. American Psycholggist, 1973, 28, 147-150.
Robins, A. R., Rog, H. L., and de Jung, J. E. Assessment of NCO leader-
ship (Test Criterion Development), U.S. Army Personnel Research
Branch, The Adjutant General's Office, Technical Research Re-
port IIII, July, 1958.
Ronan, W. W. and Prien, E. P. Toward a criterion thepry: A review and
analysis of research and opinion. Greensboro, N.C.: Richard-
son Foundation, Inc., 1966.
Scheuer, W. Performance testing in New Jersey. Good Government, 1970,
87, 5-15.
Schmidt, F. L. A pilot study for the evaluation of procedures for the
construction of performance measures in the skilled trades and
technical occupations. Proposal submitted to Development
Systems Corporation, Chicago, 111., Michigan State University,
1972.
Schmidt, F. L., Greenthal, A. L., Berner, J. G., Hunter, J. E., and
Williams, F. M. A performance measurement feasibility study:
Implications for manpower policy. Final report to Manpower
Administration, U.S. Department of Labor, Contract No.
82-17-71-48, Sept. 30, 1974 (Subcontract of Development Sys-
tems Corporation, Chicago, Ill.).
Shimberg, B., Esser, B. ., and Kruger, D. H. Occupational licensing:
Practices and policies. Washington, D.C.: Public Affairs
Press, 1972.
Siegel, A. I. The check list as a criterion of proficiency. Journal
of Applied P§ycholggy, 1954a, 93-95.
97
Siegel, A. I. Retest-reliability by a movie technique of test admin-
istrators' judgment of performance in progress. Journal of
Applied Psychology. 1954b, 88, 390-392.
Siegel, A. I. Interobserver consistency for measurements of the in-
tangible products of performance. Journal of Applied Psy-
chology, 1955, 88, 280-282.
Siegel, A. I. and Jensen, J. The development of a job sample trouble-
shooting performance examination. Journal of Applied Psy-
chology, 1955, 88, 343—347.
Spergel, P., and Lechner, S. S. Vocational assessment through work’
sampling. Journal of Jewish Communal Services, 1968, 88,
225-229.
Steel, M., Balinsky, B., and Lang, H. A study on the_use of a work
sample. Journal of Applied Psychology. 1945, 88, 14-21.
Stuit, D. 8. Personnel research and test development in the bureau of
naval research. Princeton, N.J.: Princeton University Press,
1947.
Tiffin, J. and McCormick, E. J. Industrial psychology. New Jersey:
Prentice Hall, Inc., 1965.
Tiffin, J. and Rogers, H. B. The selection and training of inspectors.
Personnel, 1941, 18, 14-31.
Wernimont, P. F. and Campbell, J. P. Signs, samples, and criteria.
Journal of Applied Psychology, 1968, 88, 372-376.
MICHIGAN STATE UNIVE
llll ll3 1293
T
|| 1111 111111111155030617900