Return to : Construction overview / CAMSIS home


CAMSIS: Detailed Construction Procedures

Although basic analyses for producing a CAMSIS measure can be carried out fairly straightforwardly with standard statistical packages, the associated procedures for data preparation and manipulation are considerably more complicated. We have developed what is, in effect, a 'template' for carrying out this work, the steps in which are set out below and in the linked pages. This first page presents an outline overview; each stage has a link to a further page that provides additional detail; each of those pages, in turn, has a link to a page with the associated command files for use in the relevant application.


1 Software

SPSS is used for data manipulation and for estimation of correspondence analysis models. This is the program for which we provide syntax files.

The estimation of RC models uses lEM, a freely available software package distributed by its author Jeroen K Vermunt. (Follow the link 'Software and user manuals').

For much of the assessment and handling of results, and also within certain stages of the data preparation processes, Microsoft Excel is used, since this provides an adequate mainstream method of handling and browsing data blocks.

Finally, a large amount of the data manipulation and reporting is done in simple plain text file editors. A particularly useful editor for PC operating systems is pfe, available freely from its website. Alternative editors are equally capable of allowing for the same manipulations, but we would advise against conducting the equivalent text file editing using word processing packages, such as Microsoft Word. These frequently add characters to files, whilst 'hiding' them from view, a practice which can interfere with procedures that rely on text file processing.


2 Preparing input data

The initial data are likely to take one of two forms: either, a set of individual records showing the occupation (including employment status where available) of the husband and that of the wife; or, a set of aggregated records showing a particular occupation for husbands, a particular occupation for wives and the frequency of that combination. (Cases where the data are at the household level would require a preliminary stage where the two partners are identified and an individual record set up for them.) The second, aggregated form is the one that is used throughout the analyses. Instructions for converting individual into aggregated records are given below in subsection 2.2.

2.1 A schema of variable names

When constructing multiple scale versions for different countries and datasets, it can save a considerable amount of time if the names used for the various occupational units are harmonised between versions. The examples given in these construction pages use a set of variable names that we would encourage other users to adopt. This greatly simplifies the process of construction. Further details.

2.2 Translating from a 'case' to a 'table' file

The 'individual case' file form (ie tabular data of one record each per husband-wife case) is not particularly efficient in terms of storage. A much more useful form (and indeed the only form readily adapted to some of the analyses described below) is a 'table' file where each record consists of a unique combination of husband and wife occupational units, plus an additional 'weighting' variable which indicates the number of individual cases which are found in this combination.

Such table files are readily constructed from the original case files, for instance in SPSS by using the 'casestovars' command (aka 'restructure', available from versions 10 onwards), or alternatively by importing output from the 'crosstab' procedure (the older method: see these instructions ).

See sub-section 2.4 below for how to deal with a possible complication.

2.3 Occupational unit 'value labels'

Data are typically supplied with numeric occupational title and status codes, and the scale estimation carried out on these original codes or an 'autorecoded' translation of them (see sub-section 2.4). At the same time, documentation exists which translates the numerical codes to their text 'value labels', as for instance in the tables of occupational titles which are listed via the versions page. It is possible to proceed through the CAMSIS estimation working only with the numeric codes, cross-checking by hand on their textual meaning when specifically required, but scale construction is much easier if automated procedures matching numeric values and their value labels are implemented. The recommended way of doing this is to utilise the value labels functions of SPSS and its subsequent output options. This method is slightly cumbersome to set up, but once prepared allows rapid linking between value labels and the estimation results associated with numeric values. Further details.

2.4 'Square autorecoded' values

While the occupational base unit indexes used here (title-only or title-by-status values) have numeric values which are held constant between applications (and are associated with the relevant value labels), those values tend to be numerically high (typically up to 999) and are not usually contiguous (so that, for example, not all of the 999 possible categories are actually used). It is considerably more convenient if values are used that start at 1 and increase incrementally up to the maximum value of the actually used codes. SPSS has a facility to implement this 'autorecoding' procedure. Further details.



3 Preliminary models using CA

Our recommended method is to start the analysis proper with Correspondence Analysis. Results from these models are subsequently used as evaluative tools throughout the rest of the scale development. This stage is particularly valuable in identifying the problematic diagonal elements that are discussed in the overview. Further details.

The final CAMSIS model will almost certainly be subject to further refinements which account for various 'confounding factors' in the social interaction structure. To generate such models, after having set up some preliminary CA models, we generally then follow the sections 4 to 6 below, often looping back through the stages to repeat and refine the models.


4 Aggregation of small groups

We choose on principle to maintain the maximum level of differentiation between occupational base units, but survey samples or census sub-samples typically involve a number of occupational units that are represented by very few cases. To avoid problems of unrepresentative sampling, therefore, a next step in almost all cases is the revision of the initial data to deal with the most sparsely represented occupational units (also see notes from overview page). This involves making decisions about merging sparsely represented units, either with other unit groups which are already very well represented, or else with a number of other sparsely represented units until the resultant aggregation is sufficiently large. Where, as in most cases, there is a distinction in terms of employment status as well as occupation, this also involves deciding which to choose as the basis for aggregation (ie, whether to preserve employment status and merge across title groups, or to preserve title group and merge between employment status categories). Further details.


5 Statistical analyses

The two basic forms of analysis that create the CAMSIS measures are, as already indicated, correspondence analysis (CA), using SPSS, and log-linear analysis with row-column association models (RC), using LEM. The former has already been introduced in section 3 above on preliminary models. The major modification needed for this stage is to deal with diagonal and pseudo-diagonal cells. This also needs to be done in the RC modelling approach, which can also take account of subsidiary dimensions.

5.0 Diagonals and pseudo-diagonals

The general problem is discussed in the overview. We wish to cater for particular pseudo-diagonal cases in the CAMSIS models because, if we do not, the relatively trivial patterns of association affecting these cases come to dominate the overall patterns affecting the occupational unit, leading to occupational scales with certain extreme values which primarily represent the probability of being in the pseudo-diagonal combination against all other situations. We deal with pseudo-diagonal combinations by identifying the particular husband-wife occupational unit combination, then building an account for it into our association model. To identify the relevant pseudo-diagonal units, we successively review the results of previous models for outlying scale scores or combination residuals, sometimes combining these results with some substantive knowledge on likely pseudo-diagonals, in turn readjusting the models at each stage with the latest set of pseudo-diagonal combinations, until we reach a stage where we believe that the influence of pseudo-diagonality patterns on the derived scale scores has been virtually removed. The specific ways in which pseudo-diagonality is are dealt with in SPSS and LEM are discussed in the next section.

5.1 Correspondence analysis using SPSS

The model estimation in SPSS is exactly the same as that given earlier for the preliminary models, with the only exception being that as model development proceeds, successively more pseudo-diagonal combinations are added and so excluded from the CA estimation process.

5.1.1 Handling pseudo-diagonals in SPSS

The handling of pseudo-diagonals is relatively straightforward in the CA approach, where we simply exclude the relevant husband-wife combinations from the analysis. Further details.

5.2 Row-column association modelling using LEM

The main 'RC' analyses used by the CAMSIS project are carried out using the program LEM to model the cell frequencies. A major advantage is that this allows the use of standard statistical criteria for evaluating the fit of the models.The estimation of RC models in LEM requires the specification of LEM command files for the chosen model structure. Details of all the options involved can be found in the LEM manual, which, along with the programme itself, has been made freely available over the internet by the programme's author, Jereon K Vermunt..

Depending on the size of the data file analysed and the power of the processing machine, RC models in LEM can be very quick (less than 1 minute) or very slow (more than 1 day) to converge (although the convergence criteria can of course be changed). Additionally, there is often more than one LEM model which could be evaluated at any one time: for instance with or without subsidiary dimensions, or with different coverage of pseudo-diagonals (and in fact, not discussed here, using alternative statistical treatments of the pseudo-diagonal combinations). The easiest way of estimating LEM models is to run a few models in sequence, often for several hours or even days, when a computer is not needed for anything else. This can be achieved by creating multiple command files and the appropriate data and design matrices, then calling the command files sequentially from an MSDOS batch file.

The accompanying file includes the specification of a typical model for input to LEM. Note, however, that it includes features discussed in sections 5.2.1 and 5.2.2, so it is important to read those first.

The remaining scale construction methods become the successive loop between the estimation of models, checking for pseudo-daigonals, if necessary revising the data with further category collapses, then re-estimating a revised model.

5.2.1 Handling pseudo-diagonals in LEM

A further advantage of LEM's RC model facilities is that they offer a more satisfactory way of dealing with pseudo-diagonals (as specific husband-wife combinations) and, as an extension of that, with subsidiary dimensions (see the following section). However, the specification of pseudo-diagonals for an RC model in LEM is significantly more complicated than it is for CA in SPSS. Further details.

5.2.2 Subsidiary dimensions in LEM

As well as identifying specific pseudo-diagonal husband-wife occupational unit combinations, there are also circumstances where it is efficient to construct whole 'subsidiary dimensions' of social association between occupations, which reflect a secondary factor that influences association likelihoods, but is separable from a generalised interaction / stratification primary dimension.

Such subsidiary dimensional structures cannot be readily added to the SPSS CA estimation structure, and can only be added to the LEM RC models if they form separable subgroups to the base occupational unit. The major examples are the employment status subgroups of the title-by-status base units, and the major groups (or sectoral cleavages denoted by them) of the occupational schema. Further details.


6 Assessing the results of CA and RC models

6.1 Exporting the results

After each iteration in the CA and RC analyses it will be necessary to assess the results to determine whether a satisfactory solution has been achieved, or what changes may need to be made. These notes provide details of how the output can be transferred to other programs.

6.2 Assessing the results

After a given model has been estimated, our principal concern has usually been to check that an adequate account of pseudo-diagonal combinations has been used, and adjust the model if not. The methods we have used for this are covered in the immediately preceding section.

In addition, there are usually two further ways we evaluate interim and final CAMSIS model results. First, we make regular substantive reviews of the derived occupational unit scale scores. This can be achieved most readily by ranking the exported derived scores alongside their occupational base unit titles, and reviewing the structures derived. Aside from being of general interest, this method is also often the quickest way of telling us whether there are any subsidiary dimension structures worth accounting for in the derived model (as we may see that the derived score orders for the single-dimensional models appear to be clustered around factors which could comprise a subsidiary dimension).

A second method of review is to examine the aggregate statistics for the interim and final models produced. With SPSS CA, relatively few such statistics can be generated, but perhaps the most significant data concerns the relative size of the singular value associated with the first dimension (the larger it is compared to other dimensions, the more consistently we can expect to have represented a primary dimension of social interaction / stratification). With LEM RC models, on the other hand, a wealth of aggregate statistics can be assessed. Dimension association statistics closely paralleled CA singular values in indicating the relative influence of the first dimension of generalised interaction / stratification. Additionally, a number of model fit statistics allow us to compare how well different models have described the patterns of husband-wife association found in the data. Thus models with differing accounts of both pseudo-diagonals and subsidiary dimensions can be compared with each other, and the results used to adjudicate on the most appropriate formulation for the final LEM model.

In ongoing CAMSIS project results we are still working on issues concerning the assessment of alternative model statistics, and substantive structures, and more details will hopefully be supplied shortly. For an early example, the project report on the Swiss scale constructions, contains some practical examples of such discussions.


6.3 Descriptive statistics and graphical displays

When the model estimation is complete, we derive descriptive statistics and graphical displays to summarise the estimated score distributions. These are useful in giving an idea of the nature of the occupational structure. In particular, the shape of these distributions is significant because it can indicate whether the location of the occupations forms a smooth hierarchy or is more akin to a discrete categorical structure. Further details.


7 Scores

By the section above, we have reached the stage where derived CAMSIS scale scores have been exported to data files which link them with the relevant occupational unit categories, plus some information on the number of cases involved in each unit. In the following sections, we describe the preparation for dissemination of those scores.

7.1 Score transformations

The initial scale construction processes generate occupational unit scores that are scaled around the (default) parameters of the model and are numerically small. We prefer to transform them to a more practically useful form. In part, this simply involves making them all positive and multiplying them up to a more amenable magnitude and precision. However, there is also an issue of how scores for different countries, and for men and women, can best be made comparable with one another. Further details.

7.2 Producing an index file

A last significant step is the transformation of the score values into 'index' files associated with each possible occupational base unit. This process takes the form of three steps. First, an index file needs to be generated which covers every possible occupational base unit category (7.2.1). Second, the appropriate derived scores needs to be assigned to those occupational units that were earlier merged in the data construction stage (7.2.2). Finally, occupational units that were not represented by any cases in the original data (but which might occur in other datasets) need to be given an imputed scale value, using closely related units.

7.2.1 Generating the file: further details.

7.2.2 Attributing scores: further details.

7.2.3 Imputing scores: further details.


7.3 Disseminating index files

Finally, when a satisfactory 'index' file has been generated, with a CAMSIS estimated score value for every occupational unit, it should be readily possible to use this file to assign CAMSIS scores to appropriate occupational units on other data sources. This section of the 'usage guidelines' page shows SPSS syntax, for instance, that can be used to match original occupational base unit data with the relevant CAMSIS derived scores.

Finally, as any user who has struggled through the full length of the details above will realise, there are many potential complexities to any given CAMSIS scale construction. Most of these can be regarded as to some extent 'optional', that is, different versions constructions may incorporate different complexities to a greater or lesser extent, with no necessary precondition that the most comprehensively derived set of scores will be the most satisfactory (although it is more likely than not!). Moreover, several of these possible complexities will reflect the input of individual researchers (for instance, choices in the initial data merging options, the choice of pseudo-diagonals, and the choice of final model using RC techniques). To promote good practice in the dissemination process, therefore, we would encourage any users who estimate and make available CAMSIS scale scores to provide the maximum amount of detail as possible on those aspects of the scale constructions.




Last modified 31 May 2004
This document is maintained by Paul Lambert (