In Danish psychologist Emil Kirkegaard’s blogpost Which test has the highest g loading?, he states that “(…) vocabulary is the single best way to measure intelligence in terms of g-loading. The chief other benefits of vocabulary tests is that they are fast to administer, and not stressful for the subjects.” But vocabulary tests also have cons, namely: “The chief disadvantage is that they are prone to bias, both with regards to age and non-native speakers.” He then goes on to extensively quote famous psychologist Arthur R. Jensen, of whom he, in another blogpost, wrote: “Pretty much everything is this field has already been said by Arthur Jensen somewhere in his 400+ papers and 5+ books.”

Jensen said one of those ‘everythings’, in this case the answer to the question why vocabulary tests are highly g-loaded and correlate highly with intelligence, in one of those 5+ books, namely his 1980 Bias in Mental Testing. The entire Jensen-quotation is given below (underlining and links where added by me):

Word knowledge figures prominently in standard tests. The scores on the vocabulary subtest are usually the most highly correlated with total IQ of any of the other subtests. This fact would seem to contradict Spearman’s important generalization that intelligence is revealed most strongly by tasks calling for the eduction of relations and correlates. Does not the vocabulary test merely show what the subject has learned prior to taking the test? How does this involve reasoning or eduction?

In fact, vocabulary tests are among the best measures of intelligence, because the acquisition of word meanings is highly dependent on the eduction of meaning from the contexts in which the words are encountered. Vocabulary for the most part is not acquired by rote memorization or through formal instruction. The meaning of a word most usually is acquired by encountering the word in some context that permits at least some partial inference as to its meaning. By hearing or reading the word in a number of different contexts, one acquires, through the mental processes of generalization and discrimination and eduction, the essence of the word’s meaning, and one is then able to recall the word precisely when it is appropriate in a new context. Thus the acquisition of vocabulary is not as much a matter of learning and memory as it is of generalization, discrimination, eduction, and inference. Children of high intelligence acquire vocabulary at a faster rate than children of low intelligence, and as adults they have a much larger than average vocabulary, not primarily because they have spent more time in study or have been more exposed to words, but because they are capable of educing more meaning from single encounters with words and are capable of discriminating subtle differences in meaning between similar words. Words also fill conceptual needs, and for a new word to be easily learned the need must precede one’s encounter with the word. It is remarkable how quickly one forgets the definition of a word he does not need. I do not mean “ need” in a practical sense, as something one must use, say, in one’s occupation; I mean a conceptual need, as when one discovers a word for something he has experienced but at the time did not know there was a word for it. Then when the appropriate word is encountered, it “ sticks” and becomes a part of one’s vocabulary. Without the cognitive “need,” the word may be just as likely to be encountered, but the word and its context do not elicit the mental processes that will make it “stick.”

During childhood and throughout life nearly everyone is bombarded by more dif­ferent words than ever become a part of the person’s vocabulary. Yet some persons acquire much larger vocabularies than others. This is true even among siblings in the same family, who share very similar experiences and are exposed to the same parental vocabu­lary.

Vocabulary tests are made up of words that range widely in difficulty (percentage passing); this is achieved by selecting words that differ in frequency of usage in the language, from relatively common to relatively rare words. (The frequency of occurrence of each of 30,000 different words per 1 million words of printed material—books, magazines, and newspapers—has been tabulated by Thorndike and Lorge, 1944.) Techni­cal, scientific, and specialized words associated with particular occupations or localities are avoided. Also, words with an extremely wide scatter of “ passes” are usually elimi­nated, because high scatter is one indication of unequal exposure to a word among persons in the population because of marked cultural, educational, occupational, or regional differences in the probability of encountering a particular word. Scatter shows up in item analysis as a lower than average correlation between a given word and the total score on the vocabulary test as a whole. To understand the meaning of scatter, imagine that we had a perfect count of the total number of words in the vocabulary of every person in the population. We could also determine what percentage of all persons know the meaning of each word known by anyone in the population. The best vocabulary test limited to, say, one hundred items would be that selection of words the knowledge of which would best predict the total vocabulary of each person. A word with wide scatter would be one that is almost as likely to be known by persons with a small total vocabulary as by persons with a large total vocabulary, even though the word may be known by less than 50 percent of the total population. Such a wide-scatter word, with about equal probability of being known by persons of every vocabulary size, would be a poor predictor of total vocabulary. It is such words that test constructors, by statistical analyses, try to detect and eliminate.

It is instructive to study the errors made on the words that are failed in a vocabulary test. When there are multiple-choice alternatives for the definition of each word, from which the subject must discriminate the correct answer among the several distractors, we see that failed items do not show a random choice among the distractors. The systematic and reliable differences in choice of distractors indicate that most subjects have been exposed to the word in some context, but have inferred the wrong meaning. Also, the fact that changing the distractors in a vocabulary item can markedly change the percentage passing further indicates that the vocabulary test does not discriminate simply between those persons who have and those who have not been exposed to the words in context. For example, the vocabulary test item ‘erudite’ has a higher percentage of errors if the word ‘polite’ is included among the distractors; the same is true for ‘mercenary’ when the words ‘stingy’ and ‘charity’ are among the distractors; and ‘stoical’ – ‘sad’, ‘droll’ – ‘eerie’, ‘fecund’ – ‘odor’, ‘fatuous’ – ‘large’.

Another interesting point about vocabulary tests is that persons recognize many more of the words than they actually know the meaning of. In individual testing they often express dismay at not being able to say what a word means, when they know they have previously heard it or read it any number of times. The crucial variable in vocabulary size is not exposure per se, but conceptual need and inference of meaning from context, which are forms of eduction. Hence, vocabulary is a good index of intelligence.

Picture vocabulary tests are often used with children and nonreaders. The most popular is the Peabody Picture Vocabulary Test. It consists of 150 large cards, each containing four pictures. With the presentation of each card, the tester says one word (a common noun, adjective, or verb) that is best represented by one of the four pictures, and the subject merely has to point to the appropriate picture. Several other standard picture vocabulary tests are highly similar. All are said to measure recognition vocabulary, as contrasted to expressive vocabulary, which requires the subject to state definitions in his or her own words. The distinction between recognition and expressive vocabulary is more formal than psychological, as the correlation between the two is close to perfect when corrected for errors of measurement.

Arthur R. Jensen – Bias in mental testing (1980)