Clinical Database Metadata Quality Control With SAS® and Python
Clinical Database Metadata Quality Control With SAS® and Python
ABSTRACT
A well-designed clinical database requires well-defined specifications for Case Report Forms (CRFs),
Field Attributes, and Data Dictionaries. The specifications are passed on to the Electronic Data Capture
(EDC) Programmers, who program the clinical database. How can a study team ensure that the source
specifications are complete, and the resulting clinical database metadata match the source
specifications? This paper presents two approaches. Initially, we used SAS® to read in the
specifications and clinical database metadata, and to provide comparison checks. Then, we converted
the project in Python in order to build a user-friendly tool that allows customers to run the checks
themselves. These reports have improved the quality of our clinical databases, as well as saved each
study team several hours of back and forth between the specification development and EDC
programming.
INTRODUCTION
There are many reasons to ensure consistent study database metadata, including:
• Study database documentation for validation
• Study build errors could affect data collection and/or analysis
• Implementing study build corrections takes time and effort
Our organization uses Medidata RAVE EDC system to design, collect, and store our clinical study
databases. The study team collectively develops the study database through an iterative process. This
process includes two study metadata sources: the study team specifications defined in the Study Build
Specification (SBS) and the study build metadata in the Study Design Specification (SDS).
Converting the SBS and SDS documents to a database of metadata provides a source for quality control
reports, as well as additional tools to increase study database development efficiency.
1
EDC PROGRAMMING
The EDC programmer uses the SBS to program the Medidata RAVE study build. The specifications are
reflected in the eCRF appearance in RAVE, as well as the underlying database. Figure 2 is a screenshot
of the Demographics eCRF in RAVE for data collection.
Figure 2. Demographics eCRF in RAVE
2
STUDY DESIGN SPECIFICATIONS (STUDY BUILD METADATA)
Upon the completion of the study build, for documentation purposes, the EDC programmer generates the
Study Design Specifications (SDS) from RAVE. The SDS is the study build metadata in XML format
(readable in Excel), with all eCRF field metadata in the SDS Fields worksheet. Figure 4 shows the SDS
Fields worksheet with a selection of the Demographics field metadata.
Figure 4. SDS Fields worksheet with Demographics eCRF metadata
The study build development process is effective but has opportunity for inconsistencies between the
source SBS and the eventual study build SDS. For example, the SBS may be missing key metadata, or
the SDS may reveal EDC programming errors. Fortunately, these two study metadata sources have many
elements that can be directly compared and are easily converted to a database.
3
SAS PILOT IMPLEMENTATION – CONVERTING SBS AND SDS TO SAS DATA
SETS
Converting the source SBS and SDS files is straightforward using SAS, using some basic SAS
procedures and programming methods.
For the SBS, the SAS data set of the individual eCRF contains the worksheet columns as variables,
which are named, typed, and formatted accordingly in a subsequent DATA step.
The SDS workbook structure is more straightforward. It contains the FIELDS worksheet of all eCRF
metadata and is converted to a SAS data set similarly using PROC IMPORT:
filename SDSFILE " /devel/opsprog/sbs_sds/pharmasug2020/SDSexample.xlsx"
encoding="utf-8";
The LIBNAME references the full path and filename of the SBS file, using the XLSX engine. The
worksheet names are saved as SAS data set SBS_SHEETS from SAS view DICTIONARY.TABLES.
4
PROC SQL is then used to create macro variables for a delimited list of eCRF worksheet names and the
total number of eCRF worksheets from the SBS_SHEETS data set (excluding worksheets that are not
related to eCRFs):
proc sql noprint;
select memname
into :tab_list
separated by '|'
from sbs_sheets
where memname not in (“CRFS”, “FOLDERS”, “DICTIONARY”)
;
select count(memname)
into :tab_count
from sbs_sheets
where memname not in (“CRFS”, “FOLDERS”, “DICTIONARY”)
;
quit;
%end;
%mend convert_sheet_to_dataset;
5
e. Log Field indicator
f. SAS Label
g. Required field indicator
h. Review groups
Quality Control reports of flagged issues are provided to the Clinical Data Manager (CDM) of the study
team. Figure 6 shows a sample spreadsheet of field label discrepancies between the SBS (field label)
and SDS (PreText) by eCRF and field.
Figure 6. Quality Control Report Example
6
friendly user interface that enabled entry validation. Python, with its myriad of ready-to-use modules, was
a handy candidate
The first 3 inputs: the protocol number, the SBS file, and an output directory are required. The SDS file is
optionally given, depending on what report the user needs generated. The user then clicks the ‘Start’
button in the lower right corner to initiate the report. Shown below is a snapshot during a typical run
process. From this point until the program is fully executed, the interface takes over as the standard
7
output, with all print statements logged through it. Thus, via carefully selected print statements, the
developer can allow the user to view/monitor progress as the code runs.
Finally, each process generates a report in Excel that is opened as soon as it is created, which brings it to
the user’s immediate attention.
8
The program is run as two main functions. The main function imports the second and initiates the user
interface when called. (See the third from last statement in the Figure 9 above). This paper will focus on
four elements of the implementation:
• reading in pertinent Excel files through python,
• allowing for code flexibility,
• an overview of setting up the user interface, and
• creating an executable that can be made available to users.
Admittedly, data transformation is really the greatest component of this project. It is briefly addressed in
the Appendix with useful resources cited, but it can be a whole topic on its own and has books written on
it. A particularly great reference is Wes McKinney’s Python for Data Analysis 1.
Reading in the CRFs is an extension of the above. However, because most of the tabs are CRFs, reading
all of them at once is easier by first identifying and excluding the few tabs that are not CRFs. First, we
collect all the tabs in the SBS Excel document into a list. Then we filter that list by removing non-CRF
tabs. What remains will be SBS CRF tabs, which we can then iterate over:
sbs_tabs = [sheet_name for sheet_name in SBS_xl.sheet_names]
CRF_tabs = [
tab
for tab in sbs_tabs
if tab.upper()
not in (
"CRFS",
"FOLDERS",
9
"DICTIONARY",)
]
The two lists above were created via a Python list comprehension. A list comprehension is a convenient
way to create a list with one code statement. The iteration is performed below. In it, each CRF tab is read
and appended to an initially empty list (df_list). As each tab is processed, a CRF count variable keeps
track of the number of CRFs. During processing, the output of the print statement is displayed through the
GUI interface, alerting the user to the progress, as is shown in Figure 8 above. Upon completion, the final
number of CRFs iterated over is also displayed.
crf_count = 0
df_list = []
for tab in CRF_tabs:
crf_count += 1
tab_name = df
df = pd.read_excel(
SBSFile,
sheet_name=tab,
header=0,
)
print(f"SBS-SDS check - Processing CRF {tab_name}... ")
df["Source_Tab"] = "{}".format(tab_name)
df_list.append(df)
For illustration, two ways are presented to handle Python strings. The first print statement uses a more
modern approach, known as f-strings. In order to combine Python variables with strings, it requires
placing the required string in f-prefixed quotation marks – single or double – with the Python variables
placed in curly brackets. An alternative older approach places the desired string in quotes, with variable
positions represented by empty sets of curly brackets. The variable names placed outside, in order, as
arguments to an attached format statement. Where possible, the former method is preferred as it requires
less typing and is also easier to read. The one caveat is f-strings are available in Python versions 3.6 and
later.
The final step creates a data frame out of df_list object. It uses pandas’ concat method, which takes a
sequence, such as a list, or a mapping, such as a dictionary or a data frame object, returning a data
frame:
sbs_variables = pd.concat(df_list, sort=False, ignore_index=True)
In an ideal world, the Excel files to be parsed are best created in a standard way with static metadata
elements. In the real world, the data generating processes evolve. Python offers a utility to allow some
amount of variation – try/catch statements. Below is an example of a function used in the tool to ‘clean
out’ Excel cell contents. It takes a given string, removes or compresses out specified unwanted
10
characters, and returns either a stripped string void of those characters, or a number if what remains after
compressing are only digits. An example use would be to find format size. In the SBS document,
character field variable formats use a dollar-sign prefix. The dollar signs can be ignored when extracting
variable width.
def ncompress(string, chars):
”””This function returns string with chars removed”””
want = ””.join([char for char in string if char not in chars and
char is not None])
try:
if float(want):
return float(want)
except ValueError:
return want.strip()
The code uses a most basic type implementation of the try/catch Python construct, where failures in the
try block are set to be handled by their type in the exception block.
Another example is given below. At the end of the run, the program immediately opens the generated
Excel report. If this fails for any reason, a note is written to the user interface screen telling the user where
to find it, along with the error. It is possible to have multiple except statements, where each exception
prior to the very last must specify the type of error they will handle, such as ValueError above. It is also
possible to use nested try/catch blocks.
try:
print(“Opening up SBS/SDS Report…”)
os.system(f’start “excel” “{rpt_workbook}”’)
except Exception as e:
print(f”Workbook can be found in location: {rpt_workbook}”)
print(f”Error {e}”)
Setting up the user interface using the Gooey module2 is relatively straightforward. Gooey also produces
interfaces with a more modern feel than other comparable Python packages. One approach is to make
the main calls into Python functions, with the required inputs – the SBS Excel file, the study name and the
output destination – as arguments. Gooey has various widgets available for each input type. For example,
text input is handled by the “TextField” widget. File selection, which allows browsing to a required file, is
made possible through the “FileChooser” widget. Selection of an output directory is facilitated by the
“DirChooser” widget. These are convenient as they minimize user input and also errors. In addition, for
some of these widgets, validation is available. For example, if only more modern versions of Excel are
acceptable as file inputs, one can include validation for that widget, so the program does not proceed if
supplied a file with the wrong extension. See the Appendix for a validation example.
@Gooey(program_name="Create SBS/SDS Report", advanced=True)
def parse_args():
stored_args = {}
script_name = os.path.splitext(os.path.basename(__file__))[0]
args_file = f"{script_name}-args.json"
if os.path.isfile(args_file):
with open(args_file) as data_file:
stored_args = json.load(data_file)
11
widget="TextField",
help="Protocol Name - e.g., HVTN 043. Used in output report name
and report title.",
)
The widgets described above are handled in parameters which are part of a decorated parse_args
function. A decorator is a Python function that takes as input another function, returning a modified
version of that input function. While the details are beyond this paper’s scope, it suffices to know we will
need the @Gooey decorator when extracting the arguments in the parse_args function, as shown above.
Inside the inner function, we create a GooeyParser object, to which expected arguments are added. The
code above shows how we would add the TextField widget, displayed as protocol_number in Figure 7.
See the Appendix for details on adding 3 other widgets.
args = parser.parse_args()
with open(args_file, "w") as data_file:
json.dump(vars(args), data_file)
return args
if __name__ == "__main__":
conf = parse_args()
print("Output Directory:", conf.output_directory)
print("SBS File:", conf.sbs_file)
print("SDS File:", conf.sds_file)
print("Protocol Number:", conf.protocol_name)
if conf.sds_file is not None:
print("Running SBS/SDS Compare since both SBS and SDS were
supplied...")
process_data(
conf.sbs_file, conf.sds_file, conf.output_directory,
conf.protocol_name
)
else:
print()
print("Running internal SBS review since no SDS supplied...")
internal_review(conf.sbs_file, conf.output_directory,
conf.protocol_name)
We end by storing inputs to a json file, so that in subsequent runs of the code, the last entered inputs are
cached and provided as defaults.
Finally, the line: if __name__ == “__main__” indicates that we would like to run the code that follows, if
the code is executed directly from that script. Otherwise, if the code in that script is imported while running
another module, then it would not be the main script but a dependency of it, and the subsequent code
would not be run.
12
any. Ideally, users then address the issues raised and regenerate the report until no more problems are
cited.
DEPLOYMENT
Once completed, the tool can be deployed to users in one step using the pyinstaller module. After
installing this module via Python’s “pip install”, one only needs to open a command prompt, navigate to
the directory containing script to be converted into an executable, and give the following command:
pyinstaller <python script name>.py --onefile
The onefile argument allows all dependencies to be packaged up in one file, which is more portable. You
can then provide users a link to the executable from a website or Sharepoint site, for example.
CONCLUSION
Using SAS and Python, Clinical Programming has created tools to extract and analyze metadata from
study database specifications to generate quality control reports. With the aid of various Python modules,
13
our Clinical Data Managers can now run the interactive tool and generate the reports independently, as
needed. The implementation of these tools into the study database development and validation process
has saved time and helped ensure a high-caliber study database for data collection and analyses.
REFERENCES
1. McKinney, Wes. 2018. Python for Data Analysis, 2nd Edition. Sebastopol, CA, O’Reilly
2. Kiehl, Chris. 2020. Gooey v1.0.3 https://github.com/chriskiehl/Gooey
ACKNOWLEDGMENTS
The authors would like to thank their colleagues at Fred Hutch/SCHARP for their inspiration, support, and
feedback.
RECOMMENDED READING
• Python for Data Analysis
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:
Craig Chin
SCHARP / Fred Hutch
craig@fredhutch.org
Lawrence Madziwa
SCHARP / Fred Hutch
lmadziwa@fredhutch.org
Any brand and product names are trademarks of their respective companies.
APPENDIX
Below is code used to create the user interface using the Gooey module:
@Gooey(program_name="Create SBS/SDS Report", advanced=True)
def parse_args():
stored_args = {}
script_name = os.path.splitext(os.path.basename(__file__))[0]
args_file = f"{script_name}-args.json"
if os.path.isfile(args_file):
with open(args_file) as data_file:
stored_args = json.load(data_file)
parser = GooeyParser(description="Create SBS/SDS Report")
parser.add_argument(
"protocol_name",
action="store",
default=stored_args.get("protocol_name"),
widget="TextField",
help="Protocol Name - e.g., HVTN 043. Used in output report name
and report title.",
)
parser.add_argument(
"sbs_file",
action="store",
default=stored_args.get("sbs_file"),
14
widget="FileChooser",
help="Study Build Specification. This Excel (xlsx) document is
created by Clinical Data Managers to define "
"the database to be programmed in RAVE. For Global Library
Build Spec, click on Help Tab on top left",
gooey_options={
"validator": {
"test": "user_input.endswith('xlsx')",
"message": "SBS file must be an Excel file with the .xlsx
extension.",
}
},
)
parser.add_argument(
"-sds_file",
widget="FileChooser",
help="Study Database Specification. This describes the database as-
built. It is exported from RAVE. See Help "
"Tab (top left) for instructions for how to export the current
SDS from RAVE. Must be .xlsx file.",
gooey_options={
"validator": {
"test": "user_input.endswith('xlsx')",
"message": "SDS file must be an Excel file with the .xlsx
extension.",
}
},
)
parser.add_argument(
"output_directory",
action="store",
default=stored_args.get("output_directory"),
widget="DirChooser",
help=f"Output directory location to save the generated reports.",
)
args = parser.parse_args()
with open(args_file, "w") as data_file:
json.dump(vars(args), data_file)
return args
if __name__ == "__main__":
conf = parse_args()
print("Output Directory:", conf.output_directory)
print("SBS File:", conf.sbs_file)
print("SDS File:", conf.sds_file)
print("Protocol Number:", conf.protocol_name)
if conf.sds_file is not None:
print("Running SBS/SDS Compare since both SBS and SDS were
supplied...")
process_data(
conf.sbs_file, conf.sds_file, conf.output_directory,
conf.protocol_name
)
else:
15
print()
print("Running internal SBS review since no SDS supplied...")
internal_review(conf.sbs_file, conf.output_directory,
conf.protocol_name)
16