MEASUREMENT AND EVALUATION IN THE EDUCATIONAL PROCESS ~ Nyalan Ilmu

PROBLEM

The selection and treatment of topics were guided by two general considerations: (1) knowledge and skills that are necessary for the development of valid evaluation measures, and (2) knowledge and competences that are required for a proper interpretation of informal and standardized tests.
We have incorporated an innovation designed to reduce the inordinate amount of time frequently devoted to statistics in introductory measurement courses. In addition to the usual narrative treatment of the statistical concepts, we have interspersed self-instructional material so that the reader is exposed to the some concepts via two different, but complementary, approaches. We have also eliminated obsolete material devoted to computing statistics from data grouped intervals, since such grouping is an unnecessary procedural step which results in formulas that obscure the conceptual meaning of the concepts.
The most fundamental change from earlier editions is in the greater emphasis given to standardize testing. Although most pupils are tested repeatedly with standardized tests, few teachers for whom this information is primarily intended have the background required to interpret this information properly.

PROBLEM SOLVING
MEASUREMENT AND EVALUATION IN THE EDUCATIONAL PROCESS
Are you aware of how extensively schools depend on a related procedure called evaluation? Throughout this book, we use the word evaluation to designate summing up processes in which value judgments plays a large, as in grading and promoting students. We consider the construction, administration, and scoring of tests as the measurement process. Interpreting such scores-saying whether they are good or bad for a specific purpose-is evaluation.
We consider of test scores to grades such as A, B, C, D, E: “Excellent,” “Good,” “Poor,” “Fair,” or “High”.” “Average,” “low” as evaluation rather than measurement, because value judgments are made. Important value judgments are made in selecting the items and the time and method of giving the test and scoring it, but the process of attaching value judgments to performance on the measure is uniquely evaluation. Whether a student’s score is good or bad for given purpose cannot be determined solely from the score itself. An interpretation must be made. The score is often interpreted in terms of fixed standards, such as 80-89 percents equals B, or in terms of the student’s rank on the test in his class or his rank in relationship to his estimated potential for learning. Interpreting one student’s test score is evaluation at an elementary level. Evaluating a curriculum or special program is complex. Procedures for complex program evaluations are treated in the American Educational Research Association Monograph series on curriculum evaluation.
If we “tests and measurements” narrowly as the preparing, administering, scoring, and forming of objective tests, we are likely to overlook important ways in which evaluation supports the entire educational system. Schools are organized as they are because, on the basis of much experience, current patterns appear to “work best”- at least in the eyes of those who make educational decisions. Why is a particular school built where it is, as large as it is, and with certain facilities? Why are some teachers hired to staff it, others not? What determines salaries, the choice of textbooks and other instructional aids, grades, promotions, reports to parents, grouping patterns, the community’s reactions to the school and its products, recommendation for college and for jobs? All of these decisions involve evaluations.
Measurement and evaluation encompass such subjective aspects as the judgments made by teachers and administrators. Faced by complex problems of measurement and evaluation of pupil growth and influences affecting it, we cannot reject any promising resource. Various sorts of information supplement each other.
Fortunately, certain concepts, principles, and skills are useful at all levels and in nearly positions. Even parents and others not professionally concerned with individual appraisal would benefit from a clear understanding of such concepts as “validity,” “reliability,” “IQ,” and “norms.” By concentrating of fundamental concepts and skills, we present in one basic textbook the essentials for most teachers. Ebel has outlined six requisites for a teacher to be competent in educational measurement:
1. Know the educational uses, as well as the limitation, of educational tests.
2. Know the criteria by which the quality of a test should be judged and how to secure evidence relating to these criteria.
3. Know how to plan a test and write the test questions to be included in it.
4. Know how to select a standardized test that will be effective in a particular situation.
5. Know how to administer a test properly, efficiently, and fairly.
6. Know how to interpret test scores correctly and fully, but with recognition of their limitations.
As Ebel (1961c), teachers must know how to perform certain aspects of measurement and evaluation themselves, such as constructing tests, giving grades, assessing potentialities, and interpreting standardized intelligence and achievement tests. They should know how to select from the many available tests, inventories, questionnaires, rating scales, check lists, and the like those most suitable for a particular purpose. Besides being able to understand directions for administering, scoring, and interpreting tests, teachers should possess the higher ability to compare the most promising ones before the choice itself is made. This requires attaining various concepts necessary to understand test publishers’ literature, reviews, and articles reporting test research.
Tests serve a variety of functions. The purpose for which a test is given determines not only the appropriate type of test but also the test’s characteristics (such as difficulty and reliability). A measure designed for accurate assessment of individual differences in arithmetic fundamentals requires very high test reliability, whereas a much shorter (and hence, less reliable) test might suffice for a class or program evaluation. Findley (1963b) classified the purpose served by tests in education fewer than three interrelated categories: (1) instructional, (2) administrative, and (3) guidance.

INSTRUCTIONAL FUNCTIONS
Participation of the teaching staff in selecting as well as constructing evaluation instruments has resulted in improved instruments on one hand and on the other hand it has resulted in clarifying the objectives of instruction and in making them real and meaningful to teachers.
Tests Provide a Means of Feedback to the Teacher Feedback from tests helps the teacher provide more appropriate instructional guidance for individual students as well as for the class as a whole. Well designed tests may also be of value for pupil self diagnosis, since they help students identify areas of specific weaknesses.
Property Constructed Tests Can Motivate Learning As a general rule, students pursue mastery of objectives more diligently if they expect to be evaluated. In the intense compellation for a student’s time, courses without examinations are often “squeezed” out of high priority positions.
Examinations Are a Useful Means of Over learning when we review, interact with or practice skills and concepts even after they have been mastered, we are engaging in what psychologists call overlearning. Even if a student correctly answers every question on a test, he is engaging in behavior that is instructionally valuable, apart from the evaluation being served by the test.
ADMINISTRATIVE FUNCTIONS
Tests Provide Mechanism for “Quality Control” for a school or school System National or local norms can provide a basis for assessing certain curricular strengths and weaknesses if a school district does not have a means for periodic self evaluation, instructional inadequacies may go unnoticed.
Tests Are Useful for Program Evaluation and Research Outcome measures are necessary to determine whether an innovative program is better or poorer than the conventional one in facilitating the attainment of specific curricular objectives. Standardized achievement tests have been the key sources of data for evaluating the success of federally funded program.
GUIDANCE FUNCTION
Tests Can Be Value in Diagnosing an Individual’s Special Aptitudes and Abilities Obtaining measures of scholastic aptitude, achievement interest, and personality is often an aspect of the counseling process. The use of information from standardized tests and inventories can be helpful for guiding the selection of a college, the choosing of an appropriate course of study, discover unrecognized abilities, and so on. Tests play an important role in today’s schools and other aspects of life. Thus teachers, especially, as well as others must know how to use and interpret tests correctly.

STATISTICAL CONCEPTS IN TEST INTERPRETATION
Nearly all of today’s test manuals refer to central tendency, percentiles, standard scores, reliability, and validity. The view emphasized some years ago by a renowned teacher of educational statistics (Walker, 1950, p. 31) is even broader: “The conclusion seems inescapable that some aspect of statistical thinking which were once assumed to belong in rather specialized technical courses must now be considered part of general cultural education.” Even to be an intelligent reader of the popular press frequently requires know ledge of certain statistical concepts.

THE UNGROUPED FREQUENCY DISTRIBUTION
If we start with the lowest score and list every possible score with frequency until we have included the highest score, we will construct an ungrouped frequency distribution. Two of most important concepts that apply to various kinds of test data are central tendency and variability.
A tendency for the scores to concentrate somewhere near the “center” is characteristic of most frequency distributions. This is a measure of central fendency : it is the value that typifies and best represents the whole distribution.

THE MEDIAN
A widely used average in educational measurement is the median. The median is the score point that divides the distribution into halves.
In a ungrouped distribution, the mid score may be called the median.
The median is often used as a reference point for describing the location of individual pupils in a distribution.

THE MEAN
The most familiar average is the arithmetic mean. This measure is in such common use that many people regard it as the average because it is the only average they know anything about. When the term “average” is used in ordinary conversation or in newspapers in such statements as “average temperature”, “average rainfall”, average yield of corn and wheat”, and “average price”, it is likely that the arithmetic mean is meant.
The mean can be computed by simply obtaining the measures and dividing by their number. The measure so obtained is then the value that each individual would have if all shared equally. Unlike the median, the mean is affected by the magnitude of every score in the distribution. Increase any score by 10 point and you increase the mean by 10/N point, where N is the number of scores in the distribution.

THE MODE
The most frequent score is called the mode. It is determined by inspection. The crude mode is not a very reliable average, especially with small groups.
COMPARISON OF THE MEAN< MEDIAN, AND MODE
The sum of the deviations (differences) of all scores from the arithmetic mean is always precisely zero (provided that one does not lose sight of the plus and minus signs). When we consider only the size of the difference between each score and the “average,” we will observe that the sum of these differences is less from the median than from any other score or point-a fact that gives some logical appeal to the median as a measure of “average,” or central tendency.

The Range
The range is simply the distance between the highest score and the lowest score.

The Quartile Deviation and Percentiles
A proper understanding of the quartile deviation as a measure of variability depends on a knowledge of percentiles. You will recall that the median is the 50th percentile; that is, 50 percent of all the frequencies lie below that point and 50 percent lie about it. A percentile is a score point in the score distribution below which the stated percentage of all measures lies. Thus an individual who scores at the 30th percentile of his class has done better than 30 percent of the students and poorer than 70 percent. Percentiles are computed in much the same manner as the median; the only difference is that the number of frequencies to be counted up (or down) varies with the percentile desired.

Stanines
Another common standars-score system is the “stanine” scale (standard ‘scores with nine categories), which was developed and used extensively by the Army Air Force during World War II. Stanines are normalized standard scores with a mean of 5 and a standard deviation of 2.
To convert a raw score to a stanine score one arranges the test papers in order from highest to lowest score. Pick the top 4 percent and assign them a stanine score of 9. The next 7 percent receive a stanine score of 8; the next 12 percent fall in stanine 7; the next 17 percent fall in stanine 6; the next 20 percent are in the 5th stanine, and so on. Sound simple? With a little practice, it is. Any conscientious clerk can secure stanine scores th; way and at the same time the scores are being normalized.

The Concept of Correlation
During the latter part of the nineteenth century, Sir Francis Galton and the pioneer English statistician Karl Person succeeded in developing the theory and mathematical basis for what is now known as correlation. They were concerned with relationship between two variables-for example, height and weight. Height and weight vary together (that is, correlate positively), though certainly not perfectly; there are “beanpoles” and “five-by-five,” which explains why the relationship is not higher than it is. It would be possible to select a group so that the taller a person in the group is the less he weighs, but this negative correlation between height and weight would not be expected for individuals picked at random from the general population.

Interpreting the Coefficient of Correlation
In interpreting a coefficient of correlation, several factors must be considered. The first is the sign on of t5he coefficient; the second is coefficient’s magnitude or size.
The second aspect, equally important but far more difficult to interpret, is the coefficient’s size, which indicates the degree or closeness of the relationship, just as the sign indicates the direction of the relationship.
Another important factor in interpreting correlation coefficients pertains to the size of the sample on which the correlation coefficient was determined.
Validity and Reliability Coefficients
One of the most important uses of the coefficient of correlation is in determining the validity of a test. As we shall see in Chapter 4, predictive validity is determined by setting up a criterion to be predicted and then computing the coefficient of correlation between the predictor scores and the scores on the criterion-for example, rank in high-school graduating class correlated with college freshman grade point average. The r so obtained is called a predictive validity coefficient and is interpreted in the same way as other coefficients or correlation.
A second use of the coefficient of correlation is in determining the comparable-forms or the test-retest “reliability” of a test. Reliability is the degree of consistency with which a test measures. Other Coefficients of Correlation
The r discussed above is by far the most common type of correlation coefficient, but there are several other types. A distribution of test scores can be expressed in various ways, such as the scores themselves, as ranks, or as a dichotomy.