The inclusion of machine-learning-derived models in systematic reviews of risk prediction models ... more The inclusion of machine-learning-derived models in systematic reviews of risk prediction models for colorectal cancer is rare. Whilst such reviews have highlighted methodological issues and limited performance of the models included, it is unclear why machine-learning-derived models are absent and whether such models suffer similar methodological problems. This scoping review aims to identify machine-learning models, assess their methodology, and compare their performance with that found in previous reviews. A literature search of four databases was performed for colorectal cancer prediction and prognosis model publications that included at least one machine-learning model. A total of 14 publications were identified for inclusion in the scoping review. Data was extracted using an adapted CHARM checklist against which the models were benchmarked. The review found similar methodological problems with machine-learning models to that observed in systematic reviews for non-machine-learn...
International Journal of Environmental Research and Public Health, 2021
Background: The growth and maturation of infants reflect their overall health and nutritional sta... more Background: The growth and maturation of infants reflect their overall health and nutritional status. The purpose of this study is to examine the associations of prenatal and early postnatal factors with infant growth (IG). Methods: A data-driven model was constructed by structural equation modelling to examine the relationships between pre- and early postnatal environmental factors and IG at age 12 months. The IG was a latent variable created from infant weight and waist circumference. Data were obtained on 274 mother–child pairs during pregnancy and the postnatal periods. Results: Maternal pre-pregnancy BMI emerged as an important predictor of IG with both direct and indirect (mediated through infant birth weight) effects. Infants who gained more weight from birth to 6 months and consumed starchy foods daily at age 12 months, were more likely to be larger by age 12 months. Infant physical activity (PA) levels also emerged as a determinant. The constructed model provided a reasonab...
IntroductionThe emergence of the novel respiratory SARS-CoV-2 and subsequent COVID-19 pandemic ha... more IntroductionThe emergence of the novel respiratory SARS-CoV-2 and subsequent COVID-19 pandemic have required rapid assimilation of population-level data to understand and control the spread of infection in the general and vulnerable populations. Rapid analyses are needed to inform policy development and target interventions to at-risk groups to prevent serious health outcomes. We aim to provide an accessible research platform to determine demographic, socioeconomic and clinical risk factors for infection, morbidity and mortality of COVID-19, to measure the impact of COVID-19 on healthcare utilisation and long-term health, and to enable the evaluation of natural experiments of policy interventions.Methods and analysisTwo privacy-protecting population-level cohorts have been created and derived from multisourced demographic and healthcare data. The C20 cohort consists of 3.2 million people in Wales on the 1 January 2020 with follow-up until 31 May 2020. The complete cohort dataset wil...
SummaryBackgroundPhysical activity (PA) levels are associated with long‐term health, and levels o... more SummaryBackgroundPhysical activity (PA) levels are associated with long‐term health, and levels of PA when young are predictive of adult activity levels.ObjectivesThis study examines factors associated with PA levels in 12‐month infants.MethodOne hundred forty‐one mother‐infant pairs were recruited via a longitudinal birth cohort study (April 2010 to March 2013). The PA level was collected using accelerometers and linked to postnatal notes and electronic medical records via the Secure Anonymised Information Linkage databank. Univariable and multivariable linear regressions were used to examine the factors associated with PA levels.ResultsUsing univariable analysis, higher PA was associated with the following (P value less than 0.05): being male, larger infant size, healthy maternal blood pressure levels, full‐term gestation period, higher consumption of vegetables (infant), lower consumption of juice (infant), low consumption of adult crisps (infant), longer breastfeeding duration, ...
International Journal for Population Data Science, 2017
ABSTRACTObjectives 1) To develop a fully data-driven framework for automatically identifying pati... more ABSTRACTObjectives 1) To develop a fully data-driven framework for automatically identifying patients with a condition from routine electronic primary care records; 2) to identify informative codes (risk factors) of arthropathy conditions in primary care records that can accurately predict a diagnosis of the conditions in secondary care records. ApproachThis study linked routine primary and secondary care records in Wales, UK held in the SAIL (Secured Anonymised Information Linkage) databank, in which the secondary care records were used as golden standard. As such, we proposed to use machine learning techniques to extract patient information and identify cohorts with a condition from the large and high-dimensional linked dataset using the following phases: data preparation, performed in the machine learning context fashion; pre-selection of initial features, ranking and selecting features into a meaningful subset by using feature selection methods; and identification algorithm deve...
BMC medical informatics and decision making, Jan 5, 2017
Patients' smoking status is routinely collected by General Practitioners (GP) in UK primary h... more Patients' smoking status is routinely collected by General Practitioners (GP) in UK primary health care. There is an abundance of Read codes pertaining to smoking, including those relating to smoking cessation therapy, prescription, and administration codes, in addition to the more regularly employed smoking status codes. Large databases of primary care data are increasingly used for epidemiological analysis; smoking status is an important covariate in many such analyses. However, the variable definition is rarely documented in the literature. The Secure Anonymised Information Linkage (SAIL) databank is a repository for a national collection of person-based anonymised health and socio-economic administrative data in Wales, UK. An exploration of GP smoking status data from the SAIL databank was carried out to explore the range of codes available and how they could be used in the identification of different categories of smokers, ex-smokers and never smokers. An algorithm was deve...
To estimate the direct healthcare cost of infants born to overweight or obese mothers to the Nati... more To estimate the direct healthcare cost of infants born to overweight or obese mothers to the National Health Service in the UK. Retrospective prevalence-based study. Combined linked anonymised electronic data sets on a cohort of mother-child pairs enrolled on the Growing Up in Wales: Environments for Healthy Living (EHL) study. Infants were categorised according to maternal early-pregnancy body mass index (BMI): healthy weight mother (18.5≤BMI<25 kg/m(2); n=342), overweight mother (25≤BMI≤29.9 kg/m(2); n=157) and obese mother (BMI≥30; n=110). 609 singleton pregnancies with available health service records and an antenatal maternal BMI. Total health service utilisation and direct healthcare costs for providing these services in the year 2012-2013. Costs are calculated as cost of the infant (no maternal costs considered) and are related to health service usage from birth to age 1 year. A strong association existed between healthcare usage cost and BMI (p<0.001). Mean total costs...
To classify wear and non-wear time of accelerometer data for accurately quantifying physical acti... more To classify wear and non-wear time of accelerometer data for accurately quantifying physical activity in public health or population level research. A bi-moving-window-based approach was used to combine acceleration and skin temperature data to identify wear and non-wear time events in triaxial accelerometer data that monitor physical activity. Local residents in Swansea, Wales, UK. 50 participants aged under 16 years (n=23) and over 17 years (n=27) were recruited in two phases: phase 1: design of the wear/non-wear algorithm (n=20) and phase 2: validation of the algorithm (n=30). Participants wore a triaxial accelerometer (GeneActiv) against the skin surface on the wrist (adults) or ankle (children). Participants kept a diary to record the timings of wear and non-wear and were asked to ensure that events of wear/non-wear last for a minimum of 15 min. The overall sensitivity of the proposed method was 0.94 (95% CI 0.90 to 0.98) and specificity 0.91 (95% CI 0.88 to 0.94). It performed...
To estimate the direct healthcare cost of being overweight or obese throughout pregnancy to the N... more To estimate the direct healthcare cost of being overweight or obese throughout pregnancy to the National Health Service in Wales. Retrospective prevalence-based study. Combined linked anonymised electronic datasets gathered on a cohort of women enrolled on the Growing Up in Wales: Environments for Healthy Living (EHL) study. Women were categorised into two groups: normal body mass index (BMI; n=260) and overweight/obese (BMI>25; n=224). 484 singleton pregnancies with available health service records and an antenatal BMI. Total health service utilisation (comprising all general practitioner visits and prescribed medications, inpatient admissions and outpatient visits) and direct healthcare costs for providing these services in the year 2011-2012. Costs are calculated as cost of mother (no infant costs are included) and are related to health service usage throughout pregnancy and 2 months following delivery. There was a strong association between healthcare usage cost and BMI (p<...
The inclusion of machine-learning-derived models in systematic reviews of risk prediction models ... more The inclusion of machine-learning-derived models in systematic reviews of risk prediction models for colorectal cancer is rare. Whilst such reviews have highlighted methodological issues and limited performance of the models included, it is unclear why machine-learning-derived models are absent and whether such models suffer similar methodological problems. This scoping review aims to identify machine-learning models, assess their methodology, and compare their performance with that found in previous reviews. A literature search of four databases was performed for colorectal cancer prediction and prognosis model publications that included at least one machine-learning model. A total of 14 publications were identified for inclusion in the scoping review. Data was extracted using an adapted CHARM checklist against which the models were benchmarked. The review found similar methodological problems with machine-learning models to that observed in systematic reviews for non-machine-learn...
International Journal of Environmental Research and Public Health, 2021
Background: The growth and maturation of infants reflect their overall health and nutritional sta... more Background: The growth and maturation of infants reflect their overall health and nutritional status. The purpose of this study is to examine the associations of prenatal and early postnatal factors with infant growth (IG). Methods: A data-driven model was constructed by structural equation modelling to examine the relationships between pre- and early postnatal environmental factors and IG at age 12 months. The IG was a latent variable created from infant weight and waist circumference. Data were obtained on 274 mother–child pairs during pregnancy and the postnatal periods. Results: Maternal pre-pregnancy BMI emerged as an important predictor of IG with both direct and indirect (mediated through infant birth weight) effects. Infants who gained more weight from birth to 6 months and consumed starchy foods daily at age 12 months, were more likely to be larger by age 12 months. Infant physical activity (PA) levels also emerged as a determinant. The constructed model provided a reasonab...
IntroductionThe emergence of the novel respiratory SARS-CoV-2 and subsequent COVID-19 pandemic ha... more IntroductionThe emergence of the novel respiratory SARS-CoV-2 and subsequent COVID-19 pandemic have required rapid assimilation of population-level data to understand and control the spread of infection in the general and vulnerable populations. Rapid analyses are needed to inform policy development and target interventions to at-risk groups to prevent serious health outcomes. We aim to provide an accessible research platform to determine demographic, socioeconomic and clinical risk factors for infection, morbidity and mortality of COVID-19, to measure the impact of COVID-19 on healthcare utilisation and long-term health, and to enable the evaluation of natural experiments of policy interventions.Methods and analysisTwo privacy-protecting population-level cohorts have been created and derived from multisourced demographic and healthcare data. The C20 cohort consists of 3.2 million people in Wales on the 1 January 2020 with follow-up until 31 May 2020. The complete cohort dataset wil...
SummaryBackgroundPhysical activity (PA) levels are associated with long‐term health, and levels o... more SummaryBackgroundPhysical activity (PA) levels are associated with long‐term health, and levels of PA when young are predictive of adult activity levels.ObjectivesThis study examines factors associated with PA levels in 12‐month infants.MethodOne hundred forty‐one mother‐infant pairs were recruited via a longitudinal birth cohort study (April 2010 to March 2013). The PA level was collected using accelerometers and linked to postnatal notes and electronic medical records via the Secure Anonymised Information Linkage databank. Univariable and multivariable linear regressions were used to examine the factors associated with PA levels.ResultsUsing univariable analysis, higher PA was associated with the following (P value less than 0.05): being male, larger infant size, healthy maternal blood pressure levels, full‐term gestation period, higher consumption of vegetables (infant), lower consumption of juice (infant), low consumption of adult crisps (infant), longer breastfeeding duration, ...
International Journal for Population Data Science, 2017
ABSTRACTObjectives 1) To develop a fully data-driven framework for automatically identifying pati... more ABSTRACTObjectives 1) To develop a fully data-driven framework for automatically identifying patients with a condition from routine electronic primary care records; 2) to identify informative codes (risk factors) of arthropathy conditions in primary care records that can accurately predict a diagnosis of the conditions in secondary care records. ApproachThis study linked routine primary and secondary care records in Wales, UK held in the SAIL (Secured Anonymised Information Linkage) databank, in which the secondary care records were used as golden standard. As such, we proposed to use machine learning techniques to extract patient information and identify cohorts with a condition from the large and high-dimensional linked dataset using the following phases: data preparation, performed in the machine learning context fashion; pre-selection of initial features, ranking and selecting features into a meaningful subset by using feature selection methods; and identification algorithm deve...
BMC medical informatics and decision making, Jan 5, 2017
Patients' smoking status is routinely collected by General Practitioners (GP) in UK primary h... more Patients' smoking status is routinely collected by General Practitioners (GP) in UK primary health care. There is an abundance of Read codes pertaining to smoking, including those relating to smoking cessation therapy, prescription, and administration codes, in addition to the more regularly employed smoking status codes. Large databases of primary care data are increasingly used for epidemiological analysis; smoking status is an important covariate in many such analyses. However, the variable definition is rarely documented in the literature. The Secure Anonymised Information Linkage (SAIL) databank is a repository for a national collection of person-based anonymised health and socio-economic administrative data in Wales, UK. An exploration of GP smoking status data from the SAIL databank was carried out to explore the range of codes available and how they could be used in the identification of different categories of smokers, ex-smokers and never smokers. An algorithm was deve...
To estimate the direct healthcare cost of infants born to overweight or obese mothers to the Nati... more To estimate the direct healthcare cost of infants born to overweight or obese mothers to the National Health Service in the UK. Retrospective prevalence-based study. Combined linked anonymised electronic data sets on a cohort of mother-child pairs enrolled on the Growing Up in Wales: Environments for Healthy Living (EHL) study. Infants were categorised according to maternal early-pregnancy body mass index (BMI): healthy weight mother (18.5≤BMI<25 kg/m(2); n=342), overweight mother (25≤BMI≤29.9 kg/m(2); n=157) and obese mother (BMI≥30; n=110). 609 singleton pregnancies with available health service records and an antenatal maternal BMI. Total health service utilisation and direct healthcare costs for providing these services in the year 2012-2013. Costs are calculated as cost of the infant (no maternal costs considered) and are related to health service usage from birth to age 1 year. A strong association existed between healthcare usage cost and BMI (p<0.001). Mean total costs...
To classify wear and non-wear time of accelerometer data for accurately quantifying physical acti... more To classify wear and non-wear time of accelerometer data for accurately quantifying physical activity in public health or population level research. A bi-moving-window-based approach was used to combine acceleration and skin temperature data to identify wear and non-wear time events in triaxial accelerometer data that monitor physical activity. Local residents in Swansea, Wales, UK. 50 participants aged under 16 years (n=23) and over 17 years (n=27) were recruited in two phases: phase 1: design of the wear/non-wear algorithm (n=20) and phase 2: validation of the algorithm (n=30). Participants wore a triaxial accelerometer (GeneActiv) against the skin surface on the wrist (adults) or ankle (children). Participants kept a diary to record the timings of wear and non-wear and were asked to ensure that events of wear/non-wear last for a minimum of 15 min. The overall sensitivity of the proposed method was 0.94 (95% CI 0.90 to 0.98) and specificity 0.91 (95% CI 0.88 to 0.94). It performed...
To estimate the direct healthcare cost of being overweight or obese throughout pregnancy to the N... more To estimate the direct healthcare cost of being overweight or obese throughout pregnancy to the National Health Service in Wales. Retrospective prevalence-based study. Combined linked anonymised electronic datasets gathered on a cohort of women enrolled on the Growing Up in Wales: Environments for Healthy Living (EHL) study. Women were categorised into two groups: normal body mass index (BMI; n=260) and overweight/obese (BMI>25; n=224). 484 singleton pregnancies with available health service records and an antenatal BMI. Total health service utilisation (comprising all general practitioner visits and prescribed medications, inpatient admissions and outpatient visits) and direct healthcare costs for providing these services in the year 2011-2012. Costs are calculated as cost of mother (no infant costs are included) and are related to health service usage throughout pregnancy and 2 months following delivery. There was a strong association between healthcare usage cost and BMI (p<...
Uploads