Developing a Method to Derive Indicative Health Literacy from Routine Socio-Demographic Data

Context: Low health literacy (HL) is a public health issue, with impacts on population health and illness, however there are few tools for collecting health literacy data in large populations. Objective: To develop a method of deriving indicative functional HL levels from routinely collected socio-demographic data. Method: We investigated which socio-demographic variables would best depict whether an individual is above or below a constructed HL competency threshold. Weighted logistic regression was used to estimate Odd Ratios for being below the threshold. Weighted Receiver Operating Characteristic (ROC) analysis examined which variables best predicted low HL. Specificity, sensitivity and area under (AU) the ROC were descriptors for ability to predict risk. Results: Three models were developed; one using all nine variables; a pragmatic model using the four most predictive variables (Qualification (whether the individual had achieved the level expected by age 16 years), Ethnicity, Home ownership, and Area Deprivation); and one using only “Qualification” (the single most predictive variable). All models showed good prediction of low HL (AUROC 0.73 (95% CI 0.71; 0.74) to 0.78 (95% CI 0.76; 0.79)), with predictive power increasing with more complex models. Conclusion: The most important predictor of low HL is achievement of the qualification level expected by age 16 years, with additional variables adding more predictive power. The developed formulae can be used to estimate functional HL levels in populations from routinely collected socio-demographic data, and hence facilitate effective development and targeting of public health communications. The method to derive the formulae will be applicable in other industrialized countries.


Introduction
The relationship between poor education, health literacy (HL) skills and health is well recognized [1][2][3]. Low health literacy is associated with greater use of medical services, lower use of preventative care, greater difficulty managing long term illnesses [1], lower levels of health [1][2][3] and higher mortality in older people [1,2]. Further, it has been shown that public health messages fail to impact on those with low education, who are not only more likely to have fewer health-promoting behaviours, but are also less likely to respond to public health campaigns than their peers with higher education [4]. Public health campaigns may therefore have the unexpected, and unwanted, effect of widening health inequalities [4]. Patient literacy and health literacy skills are thus of concern to those involved in communicating with patients and the public through public health campaigns to promote health and to reduce the risk of long-term and infectious disease, and in clinical settings to prevent and manage illness.
Low health literacy, "the cognitive and social skills which determine the motivation and ability of individuals to gain access to, understand, and use information in ways which promote and maintain good health" [5], is a social determinant of health.
It is an independent predictor for poor health and mortality, augmenting the adverse effects of other social determinants of health such as membership of a minority ethnic group, poverty, limited education, and social deprivation [1][2][3]6].
Increasing complexity of health care systems places greater cognitive demands on patients than ever before. Understanding complex health issues and obtaining necessary acute and preventive care requires understanding often complex information and navigating a system that requires a high level of understanding [7]. Materials provided to help people achieve and maintain health and to manage illness are too complex for literacy and numeracy skills of most people who need them [8].
Understanding the extent of the problems brought through low health literacy requires knowledge of the extent of the problem; in particular the number of people affected. Measurement of health literacy level at the individual level is one option. There are several validated measures of functional health literacy, capturing a wide range of health literacy skills. An example is the S-TOFHLA, a 12-minute test of literacy and numeracy skills that measures both health literacy and health numeracy skills levels, and enables people to be classified into 'inadequate', 'marginal' and 'adequate' health literacy levels [9]. There are, however, potential issues with direct measurement of HL levels, particularly individuals' possible feelings of stigma and inadequacy [10] and the time taken to complete measurement tests [11]. Another issue is that current measures are designed for individual, rather than community-level assessments, and provide little information about the level of health literacy within a population, unless applied in large population surveys. A model that could use currently available socio-demographic data to predict likely health literacy levels in individuals or populations would thus be useful for targeting clearer and more effective population public health communications.
Multiple reports have found high correlations between health literacy measures and demographic indicators such as age, ethnicity, and years of schooling [1,3]. Imputed measures based on combinations of these indicators have been proposed [12,13]. Miller [13].
The objective of the present study was built on the work of Hanchate et al. to develop and validate a method of deriving indicative functional health literacy levels from routinely collected socio-demographic data in a younger (working-age) population, applicable at national and international levels.

Method Data
The following data sources were used: The '2011 Skills for Life Survey' [14] provided the socio-demographic variables whilst 'A mismatch between population health literacy and the complexity of health information: an observational study' [8] provided the health literacy and the combined health literacy and numeracy competency thresholds. A third data set, the '2003 Skills for Life Survey' [15] was used to validate the final models. This validation dataset was chosen because it contained exactly the same questions, assessments, recruitment and sampling strategy as the '2011 Skills for Life Survey', but a different population ( Table 1).
2011 Skills for Life Survey, (SfL2011) SfL2011 was designed to measure basic skills amongst people aged between 16 and 65 in England. This was achieved by administering computerised assessments in literacy, numeracy and ICT (information and communication technology) to respondents during interviews. In all 7,230 interviews were conducted between May 2010 and February 2011. 6,050 individuals were assessed for literacy levels, 4,871 were also assessed for numeracy levels; whilst 2,274 were assessed for ICT levels. The population in SfL2011 consisted of even proportions of men and women (50% each), the majority of who categorised themselves as White British (80%). English was the first language for 89% of 16-65 year-olds. The population was distributed in roughly equal proportions across ten-year age bands [14].

A mismatch between population health literacy and the complexity of health information: an observational study
Rowlands et al. describe a method of measuring the gap between the complexity of health materials and the skills of the people for whom it is designed through identification of competency thresholds [8]. They identified two thresholds, one for health materials containing just text information, and one for health information containing both text (literacy) and numeracy information.

skills for life survey, (SfL2003)
In these survey 8,730 randomly selected adults aged 16- The present study Using the socio-demographic data from SfL2011 and the identified competency thresholds from Rowlands et al., [8] we investigated which variables would best depict whether an individual is above or below the competency thresholds.
Individuals in SfL2011, who has undertaken the literacy test, or the literacy and numeracy tests combined, were included in the study. Those individuals that did not start or finish the test were excluded leaving the final analytical samples of 5,824 individuals with established literacy levels of whom 4,773 also had an established numeracy level.

Ethics
All data used in this study are publically available and fully anonymised, and therefore ethics approval was not required. Confirmation was obtained from the Research Ethics Office at King's College London.

Statistical analysis
The two functional health literacy competency thresholds identified by Rowlands et al. (literacy and numeracy, and literacy (text) only) were the outcome variables for this study. Separate analyses were undertaken for these two outcomes. Descriptive characteristics were calculated for baseline demographics. The nine variables (sex, age, ethnicity, language, qualification level, job status, gross income, home ownership and area deprivation score) found to be statistically significant in Rowlands et al. were included in this study [8,16].
In the SfL2011 and SfL2003 surveys, data were weighted to correct for sampling errors. Weighting was undertaken by comparing the socio-demographic profile of those allocated to receive the tests, and correcting for national profiles of the English working-age population. We used these weightings in our analysis. Weighted logistic regression was used to estimate odds ratios and z-values for being below the competency thresholds. Weighted univariable and multivariable Receiver Operating Characteristics (ROC) analysis combined with Bootstrap estimation was used to examine which variables had the largest area under the ROC curve. Bootstrap estimation was used to correct for the weights to ensure correct standard errors and confidence intervals.
Three models were developed for each outcome: one using all nine variables, a more pragmatic 4-factor model using only those variables in the logistic regression with high z-values and which are commonly collected in public surveys, and one using only the variable with the highest z-value and the largest area under the ROC curve. Models with interactions among the various variables were not explored. Specificity, sensitivity, likelihood ratios and area under the ROC were used as descriptors of each model's ability to predict an individual's risk of being below the competency threshold.

Validation
Two validation methods were used: a within-data (internal) validation and a between-data (external) validation. Initial withindata validation ensured our calibration was computed correctly and was applied to check that the prediction scores fitted the dataset. Subsequently, the models were validated against the SfL2003 dataset in order to get an unbiased assessment of how the models might perform in practice, as estimates from original datasets are typically over-optimistic [17].
To maximize methodological fidelity, the SfL2003 dataset was not downloaded until after the initial analysis generating models from the SfL2011 data. Variables were recoded as necessary to ensure they were equivalently specified in both the SfL2011 and the SfL2003 dataset. The validation was done using the same statistics as for the SfL2011 dataset. This was repeated for both literacy and numeracy tested and literacy-only tested individuals.
All the statistical analyses were performed using Stata Version 12.
As this was an observational study, STROBE guidelines [18] were followed.

Results
Demographic characteristics for the combined literacy and numeracy-, and the literacy-tested, individuals are described in Table 2. Of the literacy and numeracy-tested individuals, 2,922/4,773 (61.2%) were below the competency threshold; of the literacy-tested individuals, 2,508/5,824 (43.1%) were below the competency threshold.
As most health materials contain both literacy and numeracy information [8], the results relating to the combined literacy and numeracy threshold are reported as the main findings, whilst the results of the literacy only threshold are reported in supplementary tables and figures.
The variable "qualification level" was found to be the single most predictive variable according to the z-value and the univariable ROC analysis. The additional three variables included in the pragmatic 4-factor model were ethnicity, home ownership, and socio-economic deprivation level of residential areas as measured by the Index of Multiple Deprivation (IMD) [19].
The 9-factor and the pragmatic 4-factor model were thus both significantly better than the one-factor model, while the 9-factor model was not significantly better than the 4-factor model. Developed from 4,773 individuals with complete data, the formulae for each model predict an individual's log odds of being below the health literacy and numeracy threshold. The formula for being below the threshold is a function of a given persons characteristics ( Table 3). The predicted percentage probability of any participant in a study being below the health literacy and numeracy threshold can be calculated as: eL/(1+eL)*100%. The formula for being below the threshold is thus Logit (pi) = (f(x)) Table 3 Formulae for each of the three models 1 . All the numbers are stated as estimates with 95% confidence intervals. 2 The four best health literacy and numeracy predictors: Qualifications, ethnicity, home ownership, and area deprivation level (IMD level). Table 4 Prediction of low health literacy and numeracy 1 .

Validation of the models
The internal validation proved that the prediction scores were calibrated correctly, and that the event rates for all the groups fell into the right categories. There was agreement between the probabilities and the observed data.
The results of the external validation using the SfL2003 dataset can be found in supplementary  (Figure 1S). The diagnostic properties of each model were tested and can be seen from Table 2S and 3S.

Summary of findings
To our knowledge, this is the first model to predict health literacy in an English working-age population. Our method builds on that reported by Hanchate et al., which was developed and applied in a population of older people using US socio-demographic data [13].
For both the literacy and numeracy competency threshold and the literacy-only competency threshold, three models of varying complexity were developed to impute functional health literacy and numeracy levels. ROC areas between 0.71 and 0.78 indicated that, overall, the models discriminated well among people below and above the competency thresholds. The pragmatic 4-factor model was significantly better than the one-factor model, while the 9-factor model, whilst not significantly better than the 4-factor model in the SfL2011 dataset, did appear to perform better when tested against the external validation (SfL 2003) dataset. Each model and its related formula demonstrated fair to good diagnostic properties. The implications of these results are that education level, by far the strongest predictor of healthliteracy competency, is essential in predicting health literacy levels; if education level is the only predictor available it will give a reasonable level of accuracy. If the additional three variables in the pragmatic model are available the accuracy of the prediction will be significantly improved. The improvement of the model still further, in the external validation dataset, by the 9-factor model, suggests that this is the best model to use if data on all the variables are available.

Strengths of the study
Strength of the study is the high quality of data. The dataset from which the models were developed (the SfL2011 and the validation dataset, SfL2003) are large nationally-representative samples of the English working-age population, using detailed individual-level socio-demographic data and literacy and numeracy assessments developed by education-testing experts; response bias is thus unlikely.
The availability of combined literacy and numeracy data on a large proportion of the survey sample enables models to be developed on combined literacy and numeracy skills. This is important as literacy and numeracy skills are not highly correlated at the individual level [14,15] and most health materials contain both text (literacy) and numeracy information [8].

Limitations of the study
The models are limited by the explanatory power of respectively 9, 4 or 1 predictor variable(s) considered for respectively combined health literacy and numeracy, and health literacy alone. Although there are other cultural, societal, educational and health system factors that may improve the prediction of a person being below the threshold, this research was limited to explanatory variables commonly collected in public data sets.
In the Skills for Life surveys, population skills were measured using tests of a type that, whilst widely used in national and international surveys, have been criticised for only partially measuring skills, not adequately reflecting different cultures, and not adequately reflecting 'real life' [20]. However, the skills tests used have been extensively tested and validated, and provide the best available estimates of population literacy, numeracy, and health literacy skills in England.

Comparison with the existing literature
The DAHL is an imputed measure for community-living elderly aged 65 or older in USA [13]. Hanchate

Implications for research and practice
The prediction formulae developed in this study enable a reasonably accurate prediction of health literacy competency from routinely collected socio-demographic data in the English working-age population. This enables researchers working on English datasets, where some or all of the variables in our model are collected, to derive indicative health literacy levels to population datasets, provided that the datasets contain data on educational level. Application of the formulae described in this paper to such datasets will enable researchers to explore the relationships between health literacy and health, education, and other social determinants of health. Such studies could include investigations of the health economic implications of health literacy, an area identified as an important area for research and development [1].
The models described in this paper could also be used to aid health service planning, particularly in developing clearer and more effective communication with patients and the public. Application of the models at borough area-level (150,000-350,000 people) or at national level could aid in identification of areas where services should be tailored for people with low health literacy skills, with redistribution of resources to enable health authorities in areas with high numbers of people facing health literacy challenges to develop more effective services for their patients. Public health campaigns in these areas would require better tailoring of health promotion and disease prevention campaigns to improve impact, [4] with public health and health education practitioners and organisations trained to improve communication skills.
The similarity of our findings to those of Hanchate et al., [13] undertaken in the US, indicates that the method described here is likely to be applicable in most industrialised countries. Whilst the exact data collected, and the categorisation of those data, will vary between countries, it would appear that education level, age, whether the national language is an individuals' first language, and area socio-economic deprivation will be important and valid in national models.

Unanswered questions and future research
Future research should further explore ways to effectively improve communication with patients and the public, particularly those with lower health literacy skills, and evaluate the impact on patient satisfaction, patient safety, patient health, and the impact of public health campaigns. This study only addresses functional health literacy skills; studies that explore other health literacy skills, e.g., verbal communication skills, interactive and critical health literacy skills [11] would be very valuable.

Conclusion
This research has developed and validated a method for predicting population health literacy (and numeracy) levels from routinely collected socio-demographic data. The prediction models and formulas described in this paper can be used to further investigate health literacy, health and illness, and to manage health services to provide better health services to communicate better with people with low health literacy.