Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1081870.1081923acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

A hit-miss model for duplicate detection in the WHO drug safety database

Published: 21 August 2005 Publication History

Abstract

The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world's largest database of reports on suspected adverse drug reaction incidents that occur after drugs are introduced on the market. As in other post-marketing drug safety data sets, the presence of duplicate records is an important data quality problem and the detection of duplicates in the WHO drug safety database remains a formidable challenge, especially since the reports are anonymised before submitted to the database. However, to our knowledge no work has been published on methods for duplicate detection in post-marketing drug safety data. In this paper, we propose a method for probabilistic duplicate detection based on the hit-miss model for statistical record linkage described by Copas & Hilton. We present two new generalisations of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields. We demonstrate the effectiveness of the hit-miss model for duplicate detection in the WHO drug safety database both at identifying the most likely duplicate for a given record (94.7% accuracy) and at discriminating duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other applications throughout the KDD community.

References

[1]
A. Bate, M. Lindquist, I. R. Edwards, S. Olsson, R. Orre, A. Lansner, and R. M. De Freitas. A Bayesian neural network method for adverse drug reaction signal generation. European Journal of Clinical Pharmacology, 54:315--321, 1998.
[2]
T. Belin and D. Rubin. A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association, 90:694--707, 1995.
[3]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 39--48. ACM Press, 2003.
[4]
M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 workshop on data cleaning, record linkage and object consolidation, pages 7--12, 2003.
[5]
E. A. Bortnichak, R. P. Wise, M. E. Salive, and H. H. Tilson. Proactive safety surveillance. Pharmacoepidemiology and Drug Safety, 10:191--196, 2001.
[6]
A. D. Brinker and J. Beitz. Spontaneous reports of thrombocytopenia in association with quinine: clinical attributes and timing related to regulatory action. American Journal of Hematology, 70:313--317, 2002.
[7]
J. Copas and F. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society: Series A, 153(3):287--320, 1990.
[8]
I. R. Edwards. Adverse drug reactions: finding the needle in the haystack. British Medical Journal, 315(7107):500, 1997.
[9]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
[10]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD '95: Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 127--138. ACM Press, 1995.
[11]
M. Lindquist. Data quality management in pharmacovigilance. Drug Safety, 27(12):857--870, 2004.
[12]
A. E. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Research Issues on Data Mining and Knowledge Discovery, 1997.
[13]
H. B. Newcombe. Record linkage: the design of efficient systems for linking records into individual family histories. American Journal of Human Genetics, 19:335--359, 1967.
[14]
J. N. Nkanza and W. Walop. Vaccine associated adverse event surveillance (VAEES) and quality assurance. Drug Safety, 27:951--952, 2004.
[15]
R. Orre, A. Lansner, A. Bate, and M. Lindquist. Bayesian neural networks with confidence estimations applied to data mining. Computational Statistics & Data Analysis, 34:473--493, 2000.
[16]
M. D. Rawlins. Spontaneous reporting of adverse drug reactions. II: Uses. British Journal of Clinical Pharmacology, 1(26):7--11, 1988.
[17]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269--278. ACM Press, 2002.

Cited By

View all
  • (2023)Analyses of Dupilumab-Related Ocular Adverse Drug Reactions Using the WHO’s VigiBaseAdvances in Therapy10.1007/s12325-023-02573-340:9(3830-3856)Online publication date: 26-Jun-2023
  • (2023)From Analogue to AIRalph Edwards: RARE EVENTS10.1007/978-3-031-14981-8_16(283-297)Online publication date: 21-Jan-2023
  • (2020)The safety and feasibility of a new rehabilitation robotic exoskeleton for assisting individuals with lower extremity motor complete lesions following spinal cord injury (SCI): an observational studySpinal Cord10.1038/s41393-020-0423-9Online publication date: 7-Feb-2020
  • Show More Cited By

Recommendations

Reviews

John A. Fulcher

Data cleaning is an essential first step in the knowledge discovery in databases (KDD) process. Apart from the removal of noise, another critical preprocessing task is the removal of duplicate records from the databases in question. The application of interest to the authors is drug safety, although the techniques they describe have wider applicability. Norén and coauthors use Copas and Hilton's hit-miss model [1] for statistical record linkage within the World Health Organization's (WHO's) drug safety database. They note in passing that most of the parameters needed for this model are determined by the entire data set, which reduces the risk of overfitting. Moreover, they found that adding the following features improved the performance of the standard hit-miss model: modeling errors in numerical record fields, and incorporating a computationally efficient method of handling correlated record fields. A total of 38 groups of duplicate records had been previously (manually) identified in the WHO drug safety database. The authors' modified hit-miss model was applied retrospectively to this database. This led, first, to the identification of the most likely duplicates for a given record (with 94.7 percent accuracy), and, second, to discriminating duplicates from random matches (with 63 percent recall and 71 percent precision). In short, they claim to be able to detect a "significant proportion of duplicates without generating many false leads." The authors plan to perform a prospective study at some point in the future, using their modified hit-miss model to highlight suspected duplicates in an unlabeled data subset, following up their results with a manual review. This paper will appeal to researchers with an interest in KDD, especially in preprocessing in general, and in duplicate record elimination in particular. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
August 2005
844 pages
ISBN:159593135X
DOI:10.1145/1081870
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. duplicate detection
  2. hit-miss model
  3. mixture models

Qualifiers

  • Article

Conference

KDD05

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)4
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Analyses of Dupilumab-Related Ocular Adverse Drug Reactions Using the WHO’s VigiBaseAdvances in Therapy10.1007/s12325-023-02573-340:9(3830-3856)Online publication date: 26-Jun-2023
  • (2023)From Analogue to AIRalph Edwards: RARE EVENTS10.1007/978-3-031-14981-8_16(283-297)Online publication date: 21-Jan-2023
  • (2020)The safety and feasibility of a new rehabilitation robotic exoskeleton for assisting individuals with lower extremity motor complete lesions following spinal cord injury (SCI): an observational studySpinal Cord10.1038/s41393-020-0423-9Online publication date: 7-Feb-2020
  • (2019)Androgenic Effects on Ventricular RepolarizationCirculation10.1161/CIRCULATIONAHA.119.040162140:13(1070-1080)Online publication date: 24-Sep-2019
  • (2019)Drug-induced systemic lupus: revisiting the ever-changing spectrum of the disease using the WHO pharmacovigilance databaseAnnals of the Rheumatic Diseases10.1136/annrheumdis-2018-21459878:4(504-508)Online publication date: 4-Feb-2019
  • (2017)RISE: Resolution of Identity Through Similarity Establishment on Unstructured Job DescriptionsService-Oriented Computing10.1007/978-3-319-69035-3_2(19-36)Online publication date: 18-Oct-2017
  • (2017)Automated Product-Attribute MappingTrends and Applications in Knowledge Discovery and Data Mining10.1007/978-3-319-67274-8_15(163-175)Online publication date: 7-Oct-2017
  • (2016)ReferencesMethodological Developments in Data Linkage10.1002/9781119072454.refs(233-252)Online publication date: 5-Feb-2016
  • (2015)Text and Data Mining Techniques in Adverse Drug Reaction DetectionACM Computing Surveys10.1145/271992047:4(1-39)Online publication date: 11-May-2015
  • (2014)Performance of Probabilistic Method to Detect Duplicate Individual Case Safety ReportsDrug Safety10.1007/s40264-014-0146-y37:4(249-258)Online publication date: 14-Mar-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media