Aggregation of small groups

[Return to Construction guide / CAMSIS home]

4 Data revisions

As hinted at earlier, estimating CAMSIS models for the 'raw' data can be unsatisfactory, both from a sampling point of view, because some occupational base units may only be represented by a handful of cases, and from a technical point of view, because a very large number of different base units presents estimation problems for CA models in SPSS (though not for RC models in LEM). We generally regard it as desirable, therefore, to collapse together the most sparsely represented occupational base units if and when it can be justified on empirical and substantive grounds. (Note that another line of thinking suggests that there is neither a sampling nor a technical problem with working with such sparse raw data; Rytina (2000) for instance has used data from a British survey, developing RC model estimations which are capable of handling large numbers of categories, and arguing that there is no inherent mis-specification from inferring properties of an occupation by the few cases which represent it, if that occupational unit is genuinely sparsely represented in the total population.)
The strategy in CAMSIS construction has been to use the results of the preliminary CA models described above, along with a limited theoretical input, if relevant, to identify occupational units which may reasonably be combined. We conduct such revisions separately for each version and base unit used and within each gender group; so, for example, revising the data separately for title-only and title-by-status versions, and separately for the male and female populations. Note that the effort required for this job is inversely proportional to the size of the dataset supplied for scale construction, which in practice is a strong motivation for maximising the relevant sample sizes!
We take an arbitrary threshold of 20 as the minimum number of cases by which an occupational unit or combination of occupational units could be regarded as adequately represented from a sampling perspective. In practice, because we are also aware that a number of specific cases are likely to be later identified and excluded as 'pseudo-diagonals', we find it more practical to begin with a threshold of 30 as the minimum, since this considerably increases the chances of occupational units still being represented by more than 20 cases even after certain 'pseudo-diagonals' have been excluded from the model. On the other hand, we do not strictly adhere to the minimum threshold numbers at all times. It is typically the case that a few sparsely represented occupational units have no other groups that are readily identifiable as near neighbours on either empirical or substantive grounds, and in such cases we are inclined to retain their individuality as separate, sparse groups, adding specific notes on the units involved to the relevant versions' notes distributed in the final files.
The 'empirical and substantive' grounds on which we merge occupations are of three kinds. The first is that we refer to any occupational unit subgroups such as major or minor groups and try only to merge occupational units with others from the same relevant groups.
Second, our main source of information on the similarity or otherwise of occupational units comes from the results of the multiple preliminary CA models described in section 3 earlier. From the array of model results, we scan for circumstances when the scores assigned to occupational title units, their various group units, and the various status and group-by-status units are close or distant. Then, for every sparsely represented base unit, we search for 'near neighbours' of base units which have similar scores in the preliminary model and merge those cases together.
A typical example may be that, when revising a base title-by-status unit, we note that within minor group 15, say, the scores for the minor-group-by-status model suggest that status differentials in scores are large; therefore, we merge sparsely represented occupations within their status group across neighbouring occupational titles. On the other hand, within minor group 16, say, we observe from the minor-group-by-status model that status differences in scores are less pronounced, whereas occupational differences can be large according to the title-only model; in this case, therefore, we merge sparsely represented units between status groups within title units. It is important, however, when utilising these preliminary model results to remember that sparsity itself can contribute to the estimation of an extreme score for a given unit, so the scores estimated should always be cross-checked by data on the number of cases used which contributed to those scores.
Our third and final input is, in contrast, entirely substantive and may or may not be applied depending on the researcher's preferences. In some cases it may be that we have strong substantive expectations, even in the absence of, or in contradiction to, preliminary empirical evidence, that certain occupational units are very different from their apparent near neighbours in the occupational unit schema, and / or are very similar to other units which are not near neighbours, in terms of, for instance, the occupations' technical content and expected stratification position. In such cases it may be desirable to overrule the most obvious merges based upon the evidence of the preliminary models in favour of a substantively imposed merging of relevant sparse units. In practice we have used this criteria only in a handful of cases in current CAMSIS constructions, usually documenting the relevant mergers used in the version specific notes.
When conducting these merges, we have usually worked with a printout of occupational units (such as the printout of an SPSS 'tables' command for title cross-classified by status units), which can be easily scanned for sparsely represented units. The recommend merger is simply written down by hand on the printout. Subsequently, the merges are run in SPSS in the style of this sample SPSS syntax file. In this example the recode is done on the male and female title-by-status base units; a separate, but comparable segment would need to be run if the recode were conducted on title-only base units. Note that in many current CAMSIS examples, the manual recodes shown run to several hundred examples; we suggest that data entry for the code transformations can be made easier by first entering the numbers only in a plain text file, then using replace functions (as for instance available in pfe) to insert the ()= symbols necessary for the correct SPSS syntax.
A useful hint in conducting these merges, bearing in mind that a number of cases are always likely to be excluded as 'pseudo-diagonals', is to review the merges on only the sub-population of cases which are not members of the most prominent pseudo-diagonals as identified by the preliminary models. (In addition, it is often a useful shortcut to treat all exact occupational title diagonals as pseudo-diagonals, since past experience shows that they very often eventually show up as pseudo-diagonals even if not initially included, whilst their cases contribute little to the model construction because of their diagonality itself). The chosen pseudo-diagonal cases can be excluded from the generation of the table of results for which the occupations are merged, but that does assume that those cases really will be regarded as pseudo-diagonals in the final model. This shortcut also requires the subsequent adjustment to the final file shown in the sample file, where it proves convenient to recode all pseudo-diagonals within a given range to a smaller number of named husband-wife combinations.

Return to Construction guide

Last modified 14 February 2002
This document is maintained by Paul Lambert (paul.lambert@stirling.ac.uk)