C2W3V3

C2W3V3
Hello and welcome to part 3.3 of week three, sorry I guess part three of week three of our
Health Data Analytics Opportunities and Applications course. Today I'm going to talk about
critically appraising and understanding the validity of a given data set and I'm going to focus
on two parts. The first I'm going to do a little bit of a review of some of the key insights from
the reading, the 3.2 reading that you completed, and then I'm also going to talk a little bit in
more depth about how we critically appraise data sets.
This is quite a complex topic so I'm really just going to skim the surface, but I want you to
start thinking about some of the questions you might ask as an analytics lead or as someone
directing an analytics team or as someone trying to engage in evidence-informed decision
making in order to understand the validity or relevance of the data set for the type of
decision making that you might be engaging in. So in order to fully engage in the lecture this
week you really need to have completed the reading that was assigned, that's the
Gianfrancisco and Goldstein reading, A Narrative Review on the Validity of Electronic Health
Record or EHR-based Research in Epidemiology. So if you haven't had a chance to read or
review or even scan that article yet I'm going to ask you to pause yourself here, go away,
take some time to read and you know highlight and understand that article and then come
back here pick up where you left off and I'll go into my summary of what I found to be
insightful from that research.
All right so assuming now that everyone's had the chance to get themselves up to speed
with the assigned reading I'm going to jump right into my review. So just to provide a brief
overview, Gianfrancisco and Goldstein or GG as I'm going to start calling them just to short
form it, present four challenges when using EHR data or electronic health record data. So
the first that they talk about is representativeness which means that the data may not be
representative or the sample that is collected through the EHR may not be representative of
the population for which the policy or health system change that's being implemented is
applied to.
So a good example that they provide in the reading of this is when there's an assumption
made that the racial profile of the community that visits a particular hospital is
representative of the broader community in which that hospital is located. There's a whole
bunch of reasons why this may not be the case, not the least of which is that it can actually
be difficult to even understand the extent to which this is the case because census data may
lack information about health factors so for example disability and so on and EHR data may
lack useful information about the race or ethnicity of the patients that are being seen within
that dataset. GNG also talk about the availability or interpretability of non-clinical and
clinical data and how well these two things match.
We'll talk a little bit more about that in a second. Then they also raise the issues of missing
measurements and missing visits and this quote I think really captures a lot of what the
article, what sort of underlines the article which is that by being aware of the limitations
that the data present and proactively addressing them, EHR studies will be more robust,
informative, and important to the understanding of health and disease in the population
which is to say that this article isn't claiming that EHR data should not be used in
epidemiological studies rather that careful understanding of the validity of that data and the
impact of these challenges on the EHR data and the conclusions that may be drawn from it
is really critical to being able to use the data in a responsible and meaningful way and that's
sort of the theme of this section of lectures which is to say that data are useful but they all
datasets have limitations, some form of limitations, and so it's really the responsibility of
both the person generating reports from that data, analyzing that data, and making
decisions based on that data to ask questions about what might be missing, who might have
been left out, what pieces of information were included that might mislead you into thinking
about the relevance of that particular sample size. So just to dive a little bit deeper here into
some of the threats to validity, I'm going to talk about three different types of bias.
Those are selection bias, information bias, and confounding bias. Confounding variable is
one of my favorite forms of bias. I think it's a really interesting way of thinking about
external factors which may be related to the variables that you're studying.
I'll talk about that in a second. But starting off with selection bias which is a systematic
difference between the participants and non-participants or between the treatment and
control. So an example of this might be that a set of study participants are chosen on the
basis of their distance from a particular clinic.
So this particular set of patients is chosen out of an electronic health record because they
routinely visit the clinic. But the reason why they routinely visit the clinic is because they live
in the geographical location or within close proximity to that clinic, let's say walking distance
of the clinic. Now the risk of choosing people within walking distance of the clinic is you may
be selecting in a problematic way for individuals from a particular ethnic or socioeconomic
group as we know that people tend to live in particular communities on the basis of their
history, their economy, their ability, and so on.
So in the case of this selection bias, this example that I'm giving, you may be missing other
relevant participants in the study who have other socioeconomic preconditions and this is
particularly problematic when that variable that you're ultimately selecting on. So let's say
you've chosen a clinic in downtown Vancouver, sorry not downtown Vancouver, let me
think of a better example, in the West Point Grey neighborhood of Vancouver. So you've
chosen a clinic that's sitting on West 10th Avenue near UBC and the risk of choosing that
clinic is that you're essentially choosing a wealthy predominantly white population to study
because those are the people who live in close proximity to that clinic and who have easiest
access to the clinic and will visit it most frequently.
Now the risk is that if you're setting let's say a provincial policy direction on the basis of the
epidemiological study that you conduct at that clinic, you'll be missing large portions of the
population or important critical potentially very vulnerable portions of the population who
do not have the luxury of living in the West Point Grey neighborhood and therefore are
missed by that study. For those of you who are not based in Vancouver or West Point Grey,
you can look it up on a map but it's a sort of wealthy neighborhood on the west side of
Vancouver. Another example of a threat to the validity of, so the reason why, sorry, the
reason why selection bias is a threat to the validity of the data set is that if you have
selected out a certain section of the population and then you are using that to inform
conclusions that you may draw, which ultimately would be extended to a wider portion of
the population who weren't included in the original study, you're missing something really
critically important and the validity of the data set as a way of understanding the population
as a whole is compromised.
So similarly, information bias is the bias in which key variables are incorrectly measured or
classified. So an example here is that individuals with dementia may self-report their risk
factors, so things like blood pressure, diet, smoking, so they may be asked to self-report
these risk factors but they may not recall being exposed to these when they had been
because of the fact that they have dementia. And this could result in a mischaracterization
or some really critically missing data which biases the information that you do have.
So if only those with less severe forms of dementia, for example, recall having smoked in
their younger days and those with severe dementia do not recall smoking, this may result in
a conclusion that those with severe dementia, there's no linkage between the severity of
dementia and early smoking behaviors. And you can see how this could result in
misinformed decision-making on the basis of basically mischaracterized data or incorrectly
measured data. And then the final, which as I mentioned earlier is one of my favorites, is the
confounding variable.
And this is when a variable impacts both the cause and the effect of a study. So both the
independent and the dependent variable that you're interested in studying and impacts the
effect or the outcome or the dependent variable even in the absence of the cause, but this
isn't taken into consideration when you're creating the study. So for example, there's been a
lot of research done on the relationship between caffeine drinking and heart disease or
heart issues.
And this relationship is actually strongly confounded by smoking behavior. So there's

evidently a linkage where in smoking behavior causes increased caffeine utilization and
smoking behavior also causes heart disease. Even in the absence of caffeine utilization,
smoking can cause heart disease or causes heart disease or heart issues.
And so you can see how if you take smoking behavior out, you might misrepresent or
misunderstand that linkage between caffeine and heart disease. And it may appear to be
stronger than it is because you've basically excluded this important confounding variable.
The table you see here is a replication on the one of page two of the GMG article, but it
shows, it sort of lays out as clearly as possible the linkage between those challenges that I
introduced at the beginning of this lecture, some examples of where this has occurred or
maybe particularly relevant, what the threat to validity is really linked in with the last slide
that I presented and some of the potential solutions that we may have for dealing with
these issues.
In particular, GMG highlights the relevance of natural language processing in helping to

understand the relationship between qualitative and quantitative data and to process large
amounts of qualitative data and their relevance to a study. Now, natural language
processing is related to artificial intelligence and there's lots of experts in that area that
could talk to you far more about it, but I think it poses an interesting opportunity to talk
more broadly about the fact that if we're going to engage in that type of research, which I
think is necessary that we do so, that we do some more robust examinations of the validity
of different data sets that we're relying on for decision-making, particularly in the health
sector. At all levels, at the individual clinician level all the way up to the decisions that are
made by senior executives across the whole sector.
I think that it all has to start from a core set of questions about how valid the data is and a
key understanding, which is what I'm trying to convey here and what GMG convey, very
likely more eloquently and less awkwardly than I do, about the relationship between data
validity and misunderstandings and inappropriate decision-making that are coming as a
result of that. So, just the final example or the final slide on this GMG article that I think is
just a really powerful example of how data validity can undermine the types of conclusions
that are drawn from a study. I'm just going to read this quote out.
If a given provider inquires about homelessness of a patient based on the knowledge of that
patient's situation or other external factors and documents, and then they document this in
the clinical note, we have greater assurance that this is true in the positive case. So, if the
provider goes about doing this and draws this conclusion that the patient is indeed
homeless and documents and notes that in their notes, then we have some assurance that
that is a true statement. However, the lack of mention of homelessness in a clinical note
should not be assumed as a true negative case for several reasons.
Not all providers may feel comfortable asking about and or documenting homelessness.
They may not deem this variable worth noting, or implicit bias among clinicians may affect
what is captured. So, this, I think, also ties back to what we were talking about earlier with
regards to patients with dementia, that the absence of a recollection about a particular
behavior is perhaps less powerful than the memory of a particular behavior.
So, as a result, such cases, for example, no mention of homelessness may be incorrectly
identified as not homeless, leading to selection bias, should a researcher form a cohort
exclusively of patients who are identified as homeless in the EHR. So, you can really see the
consequences of this type of misclassification or missing measurements on the outcomes of
the distinct research study itself. So, in this case, the selection or non-selection of a patient
into a particular cohort, as well as our broader conclusions that might be drawn from that
research.
In this next section of the lecture, I'm going to talk a bit about what it means to critically
appraise data's validity. So, what does it mean for data or research to be valid? So, starting
with this visual that you see on the left here, there are a… I'm going to give it a go with using
the built-in Microsoft PowerPoint recording laser pointer, so bear with me. I haven't tried
this before.
We'll see if it works. So, I think you should be seeing here me circling the visual that I'm
referring to. So, starting with this upper left-hand quadrant, this speaks to data which is
unreliable and not valid.
So, the reason why it's unreliable is because there's not a cluster of data around a particular
point, which means the data results are kind of all over the place. And the reason why it's
considered to be invalid is because it's not targeting the truth, so it's not sitting at the center
of this visual. Then we have data here on the right upper quadrant, which is unreliable
because, again, it's widely scattered.
You can imagine these as arrows flying into a target. You're really missing your mark here.
But that is valid because, ultimately, these data on the right are circling the truth or the
targeted outcome.
And then on the bottom left quadrant, we have reliable data tightly clustered but far from
the center, so not valid. And then in the final quadrant here on the bottom right, which is
both reliable, so it's tightly clustered and valid because it is right at the target. Let's see.
I'm going to try another maneuver here, and I'm going to try and square off or circle the
visual that we're aiming for. So, this is really our target, is to find data which is reliable, it's
tightly clustered, it gives us a definitive result that we feel is close to the truth, and is valid.
Lie speaks about, in that article, evidence-based medicine theories, part three of praising
the evidence, are the results valid and clinically important? It's just a nice summary of some
of the questions that might be asked when critically appraising data validity.
And what they say is that a valid piece of research gives readers the confidence that its
findings are close to the truth. In critical appraisal, we essentially determine the risk of bias
of a study. The lower the risk of bias, the more likely the findings are to be valid.
So, I think this really speaks back to the GNG article, which shows the critical importance of
examining potential sources of bias that are inherent within the data that may impact the
findings of the study. So, just to expand upon the most recent case that I was talking about,
which is the relationship between an indication of a patient being or not being homeless
and then being selected into a cohort on the grounds of them being articulated as homeless.
Imagine you ran that study based on the electronic health record data, and you ultimately
found that there was no relationship between homelessness and the occurrence of a
disease like tuberculosis.
So, you drew that conclusion on the grounds that tuberculosis appeared in equal
proportions in the group that was articulated to be homeless and the group that was not
articulated to be homeless. But the source for all of this electronic health record data was a
clinic on the downtown east side of Vancouver, which for those of you who are based in
Vancouver or who have visited Vancouver, is home to one of the largest homeless
populations. There's a case or an opportunity for critical examination of language in which
many, many homeless people reside in the downtown east side of Vancouver.
And so, what you may have found is that a large number of homeless people, on the basis
that they were not specifically identified as being homeless, were placed into the control
population, so the non-homeless population. And so, the equal occurrence of tuberculosis
across those populations is not indicative of a relationship or non-relationship between
homelessness and tuberculosis occurrence. So, what that requires is a critical appraisal of
the dataset itself, whether it is a sufficient dataset in order to be able to identify patients as
being homeless or not homeless, as well as a critical examination of the impact of potential
selection bias, in this case, realized selection bias, on the findings of that research.
And then, the next step is to examine the consequences of those findings for broader and
ongoing research in that area. So, does that impact policy? How does that affect future
studies on the linkages between homelessness and occurrence of that disease? And what
that gets at is that data validity is really about the correctness and the relevance of that
data. And it's so critical to examine these things because it helps us to prevent errors in
decision-making.
And, you know, just as a final note here, there is a technical aspect to data validity, which is
worth mentioning, and that speaks to the broader theme of this section of the course,
which is about clean, clear, and confident articulation of data findings using data visuals. But
underlying clear and well-structured data visuals are clear and well-structured datasets. And
if the datasets are not well-organized in such a way that it's possible, for example, to scan
through the data and find anomalies, like things that fall outside of a particular range, then
you may be perpetuating invalid data.
So, for example, if you have a study of patients from 2020 and you have a poorly organized
dataset and you don't catch on to the fact that five of the patients are listed as being born
before 1850 because they were incorrectly coded when their data was entered into the
system, let's say these folks were born in 1950 and they're being studied in 2020, that's not
unusual. 1850, you know, that's a fairly long time to live, 170 years, so it's unlikely that
those are actually valid data points. But they could heavily skew your conclusions based on
that data.
So, for example, if you're looking at the relationship between age and frailty and you have
these incredibly skewed data points because data was misentered, then it may result in
over-assessing that relationship or underestimating that relationship, depending on which
way the data swings. So, it's really in the best interest of any decision maker and any data
analyst to have clearly organized data in order to properly be able to identify anomalies and
how they are different from outliers. And that's actually going to be a subject of a later
course where we talk about outliers.
But for this week, I'm going to leave off here and say that the underlying theme of this
lecture session is that data validity is something worth deeply examining. And there are a lot
of tools and techniques for doing that. And it can be very much dependent upon the specific
data that you're interacting with.
So, the types of questions that you might ask are very much dependent on the specific data
that you're looking at. But one thing that I would like to emphasize is that
representativeness is particularly important when you're looking at large data sets or data
sets which may have an impact on decision-making. What that means is that the data
should be reflective of the population for which you are going to make the decision.
And the absence of represented data or the invalidity of data as a result of that is the cause
of a lot of inappropriate or incorrect decision-making. So, where someone maybe draws the
incorrect conclusion that because a sample size is large, it captures all segments of the
population. And so, the conclusions of the study are applicable to that entire population.
This is particularly challenging in health data because oftentimes minority groups, so people
of color, Black, Indigenous, and people of color are often missed in patient samples for a
whole host of reasons, including accessibility to the healthcare system or over-represented
in particular healthcare studies because there may be some underlying factor which drives
their participation in the study. Similarly, women have been missed in a lot of studies, which
is one of the reasons why we have a relatively poor understanding of how, for example,
attention deficit disorder presents in women because women were largely excluded from
studies around that condition. So, it really can't be emphasized enough that examining data
validity, looking deeply at bias, asking questions about who's captured within the data and
who might be missed, how might selection bias be presenting itself are all really critical to
effective decision-making using data.
All right, so that's it for this section, and I'll see you over in the next lecture.

C2W3V3

Uploaded by

Copyright:

Available Formats

C2W3V3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C2W3V3

Uploaded by

Copyright:

Available Formats

C2W3V3

And this relationship is actually strongly confounded by smoking behavior. So there's

In particular, GMG highlights the relevance of natural language processing in helping to

You might also like