Currently, most surveys ask for occupation with open-ended questions. The verbatim responses are coded afterwards into a classification with hundreds of categories and thousands of jobs, which is an error-prone, time-consuming, and costly task. Research related to the coding of occupations is summarized with an international literature review. Special attention is paid to our main topic, the automation of coding.
A prominent approach for automated coding is to consult a dictionary on the correct code. In contrast, we focus on data-based methods where codes for new answers are predicted from those answers that are already coded. Four different coding methods are tested on two data sets: (1) Rule-based Coding that consults a dictionary, (2) data-based Naive Bayes that allows coding for text answers with multiple words, (3) data-based Bayesian Categorical is used to improve performance when relatively few answers were coded before, and (4) Combined Methods (Boosting) combining predictions from the first three methods.
The proposed Bayesian Categorical model is able to code 38% of all answers at 3% error rate without human interaction. In all remaining cases or for higher quality human intellect is needed to decide on the correct code and computer software can only assist by suggesting possible job codes. With the prototype software we developed for this task, we expect that for 74% of all answers the correct category is provided within the top five code suggestions. The training data used for prediction consists of only 32882 coded answers which is small compared to other systems with similar purpose. The proportions given above are expected to improve with additional training data.