****  PAUL LAMBERT, UNIV. STIRLING, 29 MAR 2011

*** Stata code for derivation of CAMSIS scale scores using data on pairs of socially connected occupations 
**    using correspondence analysis 
*** 

** ALGORITHM FOR AUTOMATED SCALE GENERATION FOR ISCO88 4-DIGIT UNITS 

** Notes below  on data input and format requirements; and programme arguments including local and global arguments. 


*************************************************************
*************************************************************
*************************************************************



**************************************************************
**** DATA INPUT AND FORMAT REQUIREMENTS

** [1] Input microdata (PATHC): a rectangular Stata format file featuring data on 
*          two socially connected occupations, variable names to be given as programme arguments 
*          (example: husbands occupation; wive's occupation). The data may be unweighted, or 
*          may have integer frequency weights in a named variable (which need be specified as a programme argument)
*         Up to three variables are saved from this file (the two occupation variables, and the frequency weight 
*          indicator if it exists. Any name can be used for these variables, except for 'cfreq' which is used 
*          in the programme 

** [2] Output file name (PATHD): the programme will create an output file with this name 
*      with occupations listed alongside the recommended scale scores 
*      (User notice: In most instances, don't expect the first run of the programme to generate useful 
*        scale scores, there will probably be a need for some iterative analysis identifying and excluding 
*        'pseudo-diagonal' occupational combinations)

** [3] Pseudo-diagonal indicator file (PATHB)
*      This is a file which lists pairs of occupational combinations to be treated as 'pseudo-diagonals'.
*      This file must be tab-delimited with four columns and column headers named respectively 
*      lower_1 upper_1 lower_2 upper_2 
*      The nature of this file and its preparation (for instance via MS Excel) is explained at 
*      www.camsis.stir.ac.uk/make_camsis 
*      If you have not specified any such pairs, it is still necessary to specify this file. A file with missing 
*      category codes can be used (the net effect being that no pairs are treated as pseudo-diagonals, 
*      an example is available at http://www.camsis.stir.ac.uk/make_camsis/templates/camsis_psds_blank.txt, so 
*      a sensible default is to specify: 
*        ' global pathb "http://www.camsis.stir.ac.uk/make_camsis/templates/camsis_psds_blank.txt" ' 
*

** [4] [Not needed here] Occ codes with an expected correlate ('PATHA')
*       For information, in other automated macros this file is used to help decide the direction of the 
*         scale; for ISCO88, there is no need for this since we expect a negative correlation with isco88 values




** [5] OCC template file (PATHE) 
*      This is a database of every possible individual occ in the scheme used,  
*      For ISCO88, a generic template is available at the CAMSIS site: 
*      http://www.camsis.stir.ac.uk/make_camsis/templates/isco88templateoccs4.dat


* 

** [6] Output microdata (PATHF): this will be generated by the programme, and will comprise a 
*          version of the input microdata, but now with additional information on the recoded occupational 
*          units on which the scale itself is derived (the natural units of the two occupational 
*          categories, say 'hocc' and 'wocc',  are recoded into new variables 'occ1s' and 'occ2s' which usually 
*          have some differences in coding due to sparse categories from the former being merged in the latter). 



**************************************************************
**** PATH REQUIREMENTS


**************************************************************
**************************************************************
**** PROGRAMME ARGUMENTS


********  1) Programme arguments: 

** (To be specified whilst calling the programme)

***  Five arguments are used, denoted in the programme below as:  occ1 occ2 score weight digit 


** The first argument is the name of the variable for the first occupation in the pair (e.g. 'hocc' for husband's occupation)

** The second is the name of the variable for the second occupation in the pair (e.g. 'wocc' for wife's occupation)

** The third is the 'stump' for the name of the new variable(s) with derived CAMSIS scale scores

** The fourth is the name or value of a frequency weight variable, if one exists
**    (specify as 1 otherwise; default equals 1 only if omitted and if 'digit' is also omitted)

** The fifth is a detail on the number of digits less than the full number of digits to be used to define 
**     diagonals (specify as 0 if unsure; common values are zero or 1; 
**                  can be omitted only if weight is given, in which case defaults to zero) 

** EXAMPLES:
 
** If the estimation program name was ca_mod1, the two occupation variables were socc and focc (son and fathers occ), 
**  we wanted to call the new variable 'usa_cam1', we had a micro-data file with no weighting, and we were using 
**  four digit US SOC 2000 units in which we treat all father-son combinations within the same three-digit subgroups 
**   as diagonals, we would call the programme as: 
******   is88_ca socc focc usa_cam1 1 1 

** If the estimation program name was ca_mod1, the two occupation variables were hjob and wjob (hubsbands and wifes occ), 
**  we wanted to call the new variable 'rom_cs', we had a micro-data file with frequency weights in the variable 'freq', 
**  and we were using four digit ISCO88 units in which we treat husband-wife combinations within the same occupation 
**  as diagonals, we would call the programme as: 
******   is88_ca hjob wjob rom_cs freq 0 










****** 3) Additional terms employed by the program

** The program also creates numerous specifically named new variables and data files within its estimation process,  
*   most of which need not be preserved. 




**  The additional global argument 'occtype' is used to denote a name for the occupational unit group to which the 
*  occupational scores are being derived, such as hisco, isco88, soc2000, etc. 
* (Cf http://www.geode.stir.ac.uk/ougs.html).
* EXAMPLE: For an analysis on US SOC 2000 OUGS, declare before analysis that 'global occtype "us_soc2000" ' 


**************************************************************
**************************************************************









**************************************************************




**************************************************************

 

** COMMENTS ON PROGRAMME : 

* Opens data
* Recodes categories in response to sparse unit groups
* Includes first CA
* Excludes extreme residuals
* Re-runs CA, then evaluation of dim1 and dim2,  
* Standardises row and column scores in response to CA results 




** The following variables are generated within the analysis below; 
*   if variables with these names already exist on your data, they could be overwritten: 
* hiscv01 dimused half cfreq freq2 m1fit n_comb caresid psd2 rawsc1 rawsc2 corelv sign


**************************************************************
**************************************************************














**************************************************************
**************************************************************


**************************************************************
**** PROGRAMME SPECIFICATION


capture program drop is88_ca

program define is88_ca 

args occ1 occ2 score weight digit 


*******************************


*******************************

** Data construction (0): Generate fallback occupations for men and for women
* (the fall-backs are the most populous occupation in the group)

* Men
use $pathc, clear
capture drop cfreq
gen cfreq=1
capture replace cfreq=`weight'
keep `occ1' cfreq 
gen isco1=floor(`occ1' /1000)
gen isco2=floor(`occ1' /100)
gen isco3=floor(`occ1' /10)
gen isco4=`occ1'

* Find the most common occs within the 1,2,3,4 digit units
* 1
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco1)
egen rprop=sum(cfreq), by(`occ1') 
replace rprop=rprop / rcount 
gsort +isco1 -rprop
capture drop tocc
gen tocc=`occ1'
replace tocc=0 if (isco1[_n-1]==isco1 )
egen mod1_m=max(tocc), by(isco1)
* 2 
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco2)
egen rprop=sum(cfreq), by(`occ1') 
replace rprop=rprop / rcount 
gsort +isco2  -rprop
capture drop tocc
gen tocc=`occ1'
replace tocc=0 if (isco2[_n-1]==isco2 )
egen mod2_m=max(tocc), by(isco2 )
* 3 
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco3)
egen rprop=sum(cfreq), by(`occ1') 
replace rprop=rprop / rcount 
gsort +isco3  -rprop
capture drop tocc
gen tocc=`occ1'
replace tocc=0 if (isco3[_n-1]==isco3 )
egen mod3_m=max(tocc), by(isco3 )
* 4 
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco4 )
egen rprop=sum(cfreq), by(`occ1') 
replace rprop=rprop / rcount 
gsort +isco4  -rprop
capture drop tocc
gen tocc=`occ1'
replace tocc=0 if (isco4[_n-1]==isco4 )
egen mod4_m=max(tocc), by(isco4 )
* 
codebook `occ1'  mod1_m mod2_m mod3_m mod4_m , compact
keep `occ1'  mod1_m mod2_m mod3_m mod4_m 
sort `occ1'
egen tagi=tag(`occ1')
tab tagi
keep if tagi==1
drop tagi
summarize
sav $path9\males_recodes.dta, replace
* (This is a matrix with the original occ measure, and recodes at higher levels of aggregation)



* Women
use $pathc, clear
capture drop cfreq
gen cfreq=1
capture replace cfreq=`weight'
keep `occ2' cfreq 
gen isco1=floor(`occ2' /1000)
gen isco2=floor(`occ2' /100)
gen isco3=floor(`occ2' /10)
gen isco4=`occ2' 
* Find the most common occs within the 1,2,3,4 digit units 


* 1 
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco1)
egen rprop=sum(cfreq), by(`occ2') 
replace rprop=rprop / rcount 
gsort +isco1 -rprop
capture drop tocc
gen tocc=`occ2'
replace tocc=0 if (isco1[_n-1]==isco1 )
egen mod1_f=max(tocc), by(isco1 )
* 2 
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco2 )
egen rprop=sum(cfreq), by(`occ2') 
replace rprop=rprop / rcount 
gsort +isco2  -rprop
capture drop tocc
gen tocc=`occ2'
replace tocc=0 if (isco2[_n-1]==isco2 )
egen mod2_f=max(tocc), by(isco2 )
* 3 
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco3 )
egen rprop=sum(cfreq), by(`occ2') 
replace rprop=rprop / rcount 
gsort +isco3  -rprop
capture drop tocc
gen tocc=`occ2'
replace tocc=0 if (isco3[_n-1]==isco3 )
egen mod3_f=max(tocc), by(isco3 )
* 4 
capture drop rcount
capture drop rprop
egen rcount=sum(cfreq), by(isco4 )
egen rprop=sum(cfreq), by(`occ2') 
replace rprop=rprop / rcount 
gsort +isco4  -rprop
capture drop tocc
gen tocc=`occ2'
replace tocc=0 if (isco4[_n-1]==isco4 )
egen mod4_f=max(tocc), by(isco4 )
* 
codebook `occ2'  mod1_f mod2_f mod3_f mod4_f , compact
keep `occ2' mod1_f mod2_f mod3_f mod4_f 
sort `occ2'
egen tagi=tag(`occ2')
tab tagi
keep if tagi==1
drop tagi
summarize
sav $path9\females_recodes.dta, replace
* (This is a matrix with the original occ measure, and recodes at higher levels of aggregation)
*
dir $path9\*_recodes.dta 
**



*********************************************


** Data construction (i): convert microdata into 'table' format

use $pathc, clear
capture drop cfreq
gen cfreq=1
capture replace cfreq=`weight'
keep `occ1' `occ2' cfreq 
sort `occ1' `occ2' 
collapse (sum) cfreq, by(`occ1' `occ2')
summarize
sav $path9\temp1.dta, replace




*********************************************


** Data construction (ii): Exclude any diagonal pairs according to digit specifier 

use $path9\temp1.dta, clear
capture drop cs_dig
gen cs_dig=0
capture replace cs_dig=`digit'
tab cs_dig
capture drop cs_dig2
gen cs_dig2=10^cs_dig
capture drop cs_d_1
gen cs_d_1=floor(`occ1'/cs_dig2)
capture drop cs_d_2
gen cs_d_2=floor(`occ2'/cs_dig2)
capture drop psd1
gen psd1=(cs_d_1==cs_d_2)
tab psd1 [fw=cfreq]
sort `occ1' `occ2' 
sav $path9\temp2.dta, replace



*********************************************

** Data construction (iii): Exclude any diagonals according to specific combinations specified

** Exclude pseudo-diagonals
insheet using $pathb, clear
gen psd_n=_n
summarize
sav $path9\m1.dta, replace

use $path9\temp2.dta, clear
gen lower_1=`occ1'
gen upper_1=`occ1'
gen lower_2=`occ2'
gen upper_2=`occ2'
append using $path9\m1.dta
summarize

capture drop psd2
gen psd2=0
summarize psd_n
global psd_nm=r(max)
di $psd_nm
forvalues i =1(1)$psd_nm {
   capture scalar drop l1 l2 u1 u2
   quietly summarize lower_1 if psd_n==`i'
   scalar l1=r(mean)
   quietly summarize lower_2 if psd_n==`i'
   scalar l2=r(mean)
   quietly summarize upper_1 if psd_n==`i'
   scalar u1=r(mean)
   quietly summarize upper_2 if psd_n==`i'
   scalar u2=r(mean)
   replace psd2=1 if lower_1 >= l1 & upper_1 <= u1 & lower_2 >= l2 & upper_2 <= u2
    } 

drop if psd_n >= 1 & psd_n <= $psd_nm
* (Drops the first rows of the data, ie those with valid psd_n values, leaving microdata plus psd2 markers)

tab psd2 [fw=cfreq]
tab psd1 psd2 [fw=cfreq]

sav $path9\temp3.dta, replace




*********************************************


** Data construction (iv): Recode sparse categories after making exclusions 


use $path9\temp3.dta, clear
gen cfreq2=cfreq*(psd1==0 & psd2==0)



** Merges in the recode files for men and for women: 
sort `occ1'
merge `occ1' using $path9\males_recodes.dta
keep if _merge==1 | _merge==3
tab _merge
drop _merge
sort `occ2'
merge `occ2' using $path9\females_recodes.dta
keep if _merge==1 | _merge==3
tab _merge
drop _merge
**




** 4digit level recodes (usually redundant) 

sort `occ1'
capture drop n_occ1
egen n_occ1=sum(cfreq2), by(`occ1')
sort `occ2'
capture drop n_occ2
egen n_occ2=sum(cfreq2), by(`occ2')

capture drop occ1s 
capture drop occ2s 
gen occ1s=`occ1'
gen occ2s=`occ2'
replace occ1s=mod4_m if n_occ1 <= 30
replace occ2s=mod4_f if n_occ2 <= 30



** 3 digit level recodes if still needed 

sort occ1s
capture drop n_occ1
egen n_occ1=sum(cfreq2), by(occ1s)
sort occ2s
capture drop n_occ2
egen n_occ2=sum(cfreq2), by(occ2s)
replace occ1s=mod3_m if n_occ1 <= 30
replace occ2s=mod3_f if n_occ2 <= 30



** 2 digit level recodes if still needed 

sort occ1s
capture drop n_occ1
egen n_occ1=sum(cfreq2), by(occ1s)
sort occ2s
capture drop n_occ2
egen n_occ2=sum(cfreq2), by(occ2s)
replace occ1s=mod2_m if n_occ1 <= 30
replace occ2s=mod2_f if n_occ2 <= 30



** 1 digit level recodes if still needed 

sort occ1s
capture drop n_occ1
egen n_occ1=sum(cfreq2), by(occ1s)
sort occ2s
capture drop n_occ2
egen n_occ2=sum(cfreq2), by(occ2s)
replace occ1s=mod1_m if n_occ1 <= 30
replace occ2s=mod1_f if n_occ2 <= 30





capture log close
capture log using $path9\`score'_freqs1.txt, replace text
** The recoded occupational categories used in CA analysis, non-psd cases only
tab1 occ1s occ2s [fw=cfreq2]
capture log close


** 


sav $path9\temp4.dta, replace
keep if psd1==0 & psd2==0
summarize `occ1' occ1s `occ2' occ2s
summarize `occ1' [fw=cfreq]
scalar usedN = r(N)
di usedN
sav $path9\temp5.dta, replace


** Summary table file with original and recoded occupations, psd indicators, and frequency weights:
use $path9\temp4.dta, clear
summarize `occ1' occ1s `occ2' occ2s psd1 psd2 cfreq
keep `occ1' occ1s `occ2' occ2s psd1 psd2 cfreq
sav $pathf, replace




** Summary data on number of cases: 
use $path9\temp4.dta, clear
gen cases_1=1
collapse (sum) cases_1 [fw=cfreq], by(`occ1')
sort `occ1'
sav $path9\o1.dta, replace
use $path9\temp4.dta, clear
gen cases_2=1
collapse (sum) cases_2 [fw=cfreq], by(`occ2')
rename `occ2' `occ1'
sort `occ1'
sav $path9\o2.dta, replace







*********************




** Data construction (x): Expand data twofold in order to force rows and columns to be equal
***  [WON'T BE IMPLEMENTED IN THIS EXAMPLE]  
use $path9\temp5.dta, clear
keep `occ1' occ1s `occ2' occ2s cfreq 
gen half=1
sav $path9\bit1.dta, replace
capture drop temp
rename `occ1'  temp
rename `occ2' `occ1' 
rename temp `occ2' 
rename occ1s  temp
rename occ2s occ1s 
rename temp occ2s 
recode half 1=2
sav $path9\bit2.dta, replace
use $path9\bit1.dta, clear
append using $path9\bit2.dta
tab half
gen freq2=floor( (cfreq+1) /2)
summarize
sav $path9\temp6.dta, replace
*** [NOT USED IN THIS DERIVATION]


*****************************************


use $path9\temp5.dta, clear
keep `occ1' occ1s `occ2' occ2s cfreq 
gen freq2=cfreq


** First CA
capture log close
capture log using $path9\`score'_ca1.txt, replace text
table occ1s, c(sum freq2 min `occ1' max `occ1')
table occ2s, c(sum freq2 min `occ2' max `occ2')
ca occ1s occ2s  [fweight=freq2], dim(2)
capture log close


* 1 round of possible psds leading to a second CA excluding the extreme cases: 
capture drop m1fit
predict m1fit, fit 
sort `occ1' `occ2' 
capture drop n_comb
egen n_comb=sum(freq2), by(`occ1' `occ2')
regress n_comb m1fit [fweight=freq2]
capture drop caresid
predict caresid, rstandard
capture drop psd2
gen psd2=0
replace psd2=1 if (caresid > 5 | caresid  < -5)
tab psd2 [fweight=freq2]
** Second CA 
ca `occ1' `occ2' [fweight=freq2] if psd2==0, dim(2) 



*** Extract scores associatd with occ1s and occ2s respectively
capture drop rawsc1
capture drop rawsc2
capture drop rawsc3
capture drop rawsc4
predict rawsc1, rowscore(1)
predict rawsc2, rowscore(2)
predict rawsc3, colscore(1)
predict rawsc4, colscore(2)
sav $path9\ca.dta, replace


****** occ1s: the scores for the rows (mens occupations)
use $path9\ca.dta, clear 
sort `occ1' 
correlate rawsc1 rawsc2 rawsc3 rawsc4 occ1s `occ1' 
correlate rawsc1 `occ1' 
capture scalar drop c1
scalar c1=r(rho)
correlate rawsc2 `occ1'
capture scalar drop c2
scalar c2=r(rho)
capture drop dimused
gen dimused=1
replace dimused=2 if ( (c2*c2)>(c1*c1) )
tab dimused
gen rawsc=rawsc1
replace rawsc=rawsc2 if (dimused==2)
gen rawsc_2=rawsc2
replace rawsc_2=rawsc1 if (dimused==2)
capture drop dimused
sav $path9\`score'_row.dta, replace
capture drop rawsc1
capture drop rawsc2 
* Tool to convert dimension score sign if necessary
correlate rawsc `occ1'
capture scalar drop corel /* If the sign is negative, that's correct already; if it's positive, we want to reverse it */
scalar corel=r(rho)
gen corelv=corel 
gen sign=1
replace sign=-1 if corelv > 0
rename rawsc temp
gen rawsc=sign*temp
drop temp
correlate rawsc_2 `occ1'
capture scalar drop corel2
scalar corel2=r(rho)
gen corelv2=corel2
gen sign2=1
replace sign2=-1 if corelv2 > 0
rename rawsc_2 temp
gen rawsc_2=sign2*temp
drop temp
summarize occ1s rawsc rawsc_2   
keep occ1s rawsc rawsc_2  
sort occ1s
collapse (mean) rawsc rawsc_2 , by(occ1s) 
summarize
sort occ1s
sav $path9\scores_a.dta, replace


****** occ2s: the scores for the columns (womens occupations)
use $path9\ca.dta, clear 
sort `occ2' 
correlate rawsc3 rawsc4 occ2s `occ2'
correlate rawsc3 `occ2'
capture scalar drop c1
scalar c1=r(rho)
correlate rawsc4 `occ2'
capture scalar drop c2
scalar c2=r(rho)
capture drop dimused
gen dimused=1
replace dimused=2 if ( (c2*c2)>(c1*c1) )
tab dimused
gen rawsc=rawsc3
replace rawsc=rawsc4 if (dimused==2)
gen rawsc_2=rawsc4
replace rawsc_2=rawsc3 if (dimused==2)
capture drop dimused
sav $path9\`score'_col.dta, replace
capture drop rawsc3
capture drop rawsc4 
* Tool to convert dimension score sign if necessary
correlate rawsc `occ2'
capture scalar drop corel
scalar corel=r(rho)
gen corelv=corel
gen sign3=1
replace sign3=-1 if corelv > 0
rename rawsc temp
gen rawsc=sign3*temp
drop temp
correlate rawsc_2 `occ2'
capture scalar drop corel2
scalar corel2=r(rho)
gen corelv2=corel2
gen sign4=1
replace sign4=-1 if corelv2 > 0
rename rawsc_2 temp
gen rawsc_2=sign4*temp
drop temp
summarize occ2s rawsc rawsc_2   
keep occ2s rawsc rawsc_2  
sort occ2s
collapse (mean) rawsc rawsc_2 , by(occ2s) 
summarize
sort occ2s
sav $path9\scores_b.dta, replace



* Retrieve the full microdata including non-psds for purposes of scaling, and match
*   scale scores against it: 
* Standardise scaled score to population level mean 50, sd 15:

use $path9\temp4.dta, clear
summarize `occ1' occ1s `occ2' occ2s cfreq 
keep `occ1' occ1s  cfreq 
gen par=0
sort occ1s
merge occ1s using $path9\scores_a.dta
tab _merge
keep if _merge==1 | _merge==3
drop _merge
sav $path9\bit1.dta, replace

use $path9\temp4.dta, clear
summarize `occ1' occ1s `occ2' occ2s cfreq 
keep `occ2' occ2s  cfreq
gen par=1
sort occ2s
merge occ2s using $path9\scores_b.dta
tab _merge
keep if _merge==1 | _merge==3
drop _merge
rename occ2s occ1s
rename `occ2' `occ1'
sav $path9\bit2.dta, replace

use $path9\bit1.dta, clear
append using $path9\bit2.dta
tab par
summarize
sort occ1s
* At this stage, the two derivd variables are rawsc and rawsc2 for both men and women; for men 
* when par=0 and for women when par=1



* Calculate a zscore for the raw scores for the male and female total populations: 



summarize rawsc [fw=cfreq] if par==0 
gen zm1= (rawsc - r(mean)) / r(sd) 
summarize rawsc_2 [fw=cfreq] if par==0
gen zm2= (rawsc_2 - r(mean)) / r(sd)
summarize zm1 zm2 [fw=cfreq] if par==0 

summarize rawsc [fw=cfreq] if par==1 
gen zf1= (rawsc - r(mean)) / r(sd) 
summarize rawsc_2 [fw=cfreq] if par==1
gen zf2= (rawsc_2 - r(mean)) / r(sd)
summarize zf1 zf2 [fw=cfreq] if par==1 




* Re-scale the zscore to the CAMSIS standard range 
gen `score'm = (zm1*(15)) + 50 
replace `score'm = 99 if `score'm >= 99
replace `score'm = 1 if `score'm <= 1
gen `score'm2 = (zm2*(15)) + 50 
replace `score'm2 = 99 if `score'm2 >= 99
replace `score'm2 = 1 if `score'm2 <= 1

gen `score'f = (zf1*(15)) + 50 
replace `score'f = 99 if `score'f >= 99
replace `score'f = 1 if `score'f <= 1
gen `score'f2 = (zf2*(15)) + 50 
replace `score'f2 = 99 if `score'f2 >= 99
replace `score'f2 = 1 if `score'f2 <= 1

sav $path9\temp12.dta, replace

* Split this into a file for rows (typically males) and another for columns (females): 
use $path9\temp12.dta, clear
keep if par==0
sav $path9\temp12m.dta, replace

use $path9\temp12.dta, clear
keep if par==1
sav $path9\temp12f.dta, replace


** Attribute scores plus proportions in occupations: 
use $path9\temp12m.dta, clear
sort occ1s
egen used_m=sum(cfreq), by(occ1s)
sort `occ1'
collapse (mean) `score'm (mean) `score'm2 (mean) used [fw=cfreq] , by(`occ1') 
list `occ1' `score'm `score'm2
sort `occ1'
sav $path9\bit1.dta, replace

use $path9\temp12f.dta, clear
sort occ1s
egen used_f=sum(cfreq), by(occ1s)
sort `occ1'
collapse (mean) `score'f (mean) `score'f2 (mean) used [fw=cfreq] , by(`occ1') 
list `occ1' `score'f `score'f2
sort `occ1'
sav $path9\bit2.dta, replace

use $path9\bit1.dta, clear
summarize
sort `occ1'
merge `occ1' using $path9\bit2.dta
tab _merge
* (keep all merge permutations) 
drop _merge


rename `occ1' tempnname /* two stage rename in case occ1 and occtype are the same */
rename tempnname $occtype
sort $occtype
label variable `score'm "Male CAMSIS for $occtype (`score'm)"
label variable `score'm2 "Dim 2 CAMSIS for $occtype (`score'm2)"
label variable `score'f "Female CAMSIS for $occtype (`score'f)"
label variable `score'f2 "Dim 2 female CAMSIS for $occtype (`score'f2)"

sav $path9\`score'_details.dta, replace
keep $occtype `score'm `score'm2 `score'f `score'f2
codebook, compact
summarize
gen orig=1
sort $occtype 
sav $path9\s1.dta, replace

*** These are the scores according to occupations represented in the data on their recoded forms 



***************************************



**  Next, distribute scores to all known iscos, with the 'orig' indicator to show if isco was represented in 
*     version-specific derivation
* (The commands below will calculate group isco scores as weighted means, then link these with template data if necessary) 


* Get data on the original representation of occs in the samples, merged with the derived scores

use $path9\s1.dta, clear
keep $occtype  `score'm `score'm2
rename $occtype occ1s
sav $path9\m1.dta, replace
use `occ1' occ1s cfreq using $path9\temp4.dta, clear
sort `occ1' 
sav $path9\bit7a.dta, replace
use `occ1' using $pathc, clear
sort `occ1'
merge `occ1' using $path9\bit7a.dta
drop _merge
sort occ1s 
merge occ1s using $path9\m1.dta
drop _merge
summarize
rename `occ1' tempnname
rename tempnname $occtype /* In two stages in case occ1 and occtype are the same */
gen one=1
collapse (sum) mfreq=one (mean) `score'm `score'm2 , by($occtype)
sort $occtype
sav $path9\f1.dta, replace
* -> a dataset of all male occs from original analysis, with numbers per occ 

use $path9\s1.dta, clear
keep $occtype  `score'f `score'f2
rename $occtype occ2s
sav $path9\m1.dta, replace
use `occ2' occ2s cfreq using $path9\temp4.dta, clear
sort `occ2' 
sav $path9\bit7a.dta, replace
use `occ2' using $pathc, clear
sort `occ2'
merge `occ2' using $path9\bit7a.dta
drop _merge
sort occ2s 
merge occ2s using $path9\m1.dta
drop _merge
summarize
rename `occ2' $occtype
gen one=1
collapse (sum) ffreq=one (mean) `score'f `score'f2, by($occtype)
sort $occtype
sav $path9\f2.dta, replace
* -> a dataset of all female occs from original analysis, with numbers per occ 

********************************************


********************************************

* Link these with the unit group template file
insheet using $pathe, clear
capture rename isco88 $occtype  
sort $occtype
sav $path9\f3.dta, replace

* Merge the template file  with the scores files: 
use $path9\f3.dta, clear
sort $occtype 
merge $occtype using $path9\f1.dta, _merge(orig_m)
sort $occtype
merge $occtype using $path9\f2.dta, _merge(orig_f)
summarize

* Find subgroup averages for row and columns based on the scores files only: 
capture drop occ3r
gen occ3r=floor($occtype/10)
capture drop occ2r
gen occ2r=floor($occtype/100)
capture drop occ1r
gen occ1r=floor($occtype/1000)

** In the special case of isco, it is also possible to have 1, 2 and 3 digit occs (e.g. 1, 13)
** which are equivalent to the 4-digit major group summaries derived above (e.g. 1000, 1300) 
replace occ3r=$occtype if $occtype >= 100 & $occtype <= 999 
replace occ3r=($occtype)*10 if $occtype >= 10 & $occtype <= 99 
replace occ3r=($occtype)*100 if $occtype >= 1 & $occtype <= 9
replace occ2r=floor($occtype/10) if $occtype >= 100 & $occtype <= 999
replace occ2r=$occtype if $occtype >= 10 & $occtype <= 99 
replace occ2r=($occtype)*10 if $occtype >= 1 & $occtype <= 9 
replace occ1r=floor($occtype/100) if $occtype >= 100 & $occtype <= 999
replace occ1r=floor($occtype/10) if $occtype >= 10 & $occtype <= 99 
replace occ1r=$occtype if $occtype >= 1 & $occtype <= 9 

sav $path9\f5.dta, replace

use $path9\f5.dta, clear
collapse (mean) occ3rm=`score'm [fw=mfreq], by(occ3r)
summarize
sort occ3r 
sav $path9\f5ma.dta, replace
use $path9\f5.dta, clear
collapse (mean) occ2rm=`score'm [fw=mfreq], by(occ2r)
summarize
sort occ2r 
sav $path9\f5mb.dta, replace
use $path9\f5.dta, clear
collapse (mean) occ1rm=`score'm [fw=mfreq], by(occ1r)
summarize
sort occ1r 
sav $path9\f5mc.dta, replace


use $path9\f5.dta, clear
collapse (mean) occ3rf=`score'f [fw=ffreq], by(occ3r)
summarize
sort occ3r 
sav $path9\f5fa.dta, replace
use $path9\f5.dta, clear
collapse (mean) occ2rf=`score'f [fw=ffreq], by(occ2r)
summarize
sort occ2r 
sav $path9\f5fb.dta, replace
use $path9\f5.dta, clear
collapse (mean) occ1rf=`score'f [fw=ffreq], by(occ1r)
summarize
sort occ1r 
sav $path9\f5fc.dta, replace


use $path9\f5.dta, clear
sort occ3r 
merge occ3r using $path9\f5ma.dta
drop _merge
sort occ3r 
merge occ3r using $path9\f5fa.dta
drop _merge
sort occ2r 
merge occ2r using $path9\f5mb.dta
drop _merge
sort occ2r 
merge occ2r using $path9\f5fb.dta
drop _merge
sort occ1r
merge occ1r using $path9\f5mc.dta
drop _merge
sort occ1r 
merge occ1r using $path9\f5fc.dta
drop _merge

* We've now matched on subgroup averages conditional on employment status; there should be sufficient
*   to distribute to all groups 




summarize `score'm `score'm2  `score'f `score'f2  orig_m orig_f 
replace orig_m=0 if missing(orig_m)
replace orig_f=0 if missing(orig_f)
label variable orig_m "Occupation was represented in the original dataset (by males) "
label variable orig_f "Occupation was represented in the original dataset (by females) "

replace `score'm=occ3rm if missing(`score'm) 
replace `score'f=occ3rf if missing(`score'f) 
summarize `score'm `score'm2  `score'f `score'f2  orig_m orig_f 
replace `score'm=occ2rm if missing(`score'm) 
replace `score'f=occ2rf if missing(`score'f) 
summarize `score'm `score'm2  `score'f `score'f2  orig_m orig_f 
replace `score'm=occ1rm if missing(`score'm) 
replace `score'f=occ1rf if missing(`score'f) 
summarize `score'm `score'm2  `score'f `score'f2  orig_m orig_f 





*****************************************************************
*****************************************************************
*****************************************************************



keep $occtype `score'm `score'f `score'm2 `score'f2 orig_m orig_f
order $occtype `score'm `score'f `score'm2 `score'f2 orig_m orig_f

label variable $occtype "Occupational unit - $occtype"
label variable `score'm "Male CAMSIS score for $occtype" 
label variable `score'f "Female CAMSIS score for $occtype" 

label variable `score'm2 "Dim 2 scale for males for $occtype" 
label variable `score'f2 "Dim 2 scale for females for $occtype" 



drop if missing($occtype)
codebook, compact
summarize
sav $pathd_details, replace


keep $occtype `score'm `score'f
codebook, compact
summarize
sav $pathd, replace


end 


***********************************************************




***********************************************************
***********************************************************

** EOF