Suggestions for a Web based universal exchange and inference language for medicine

Thomas Caruso; Barry Robson

Suggestions for a Web Based Universal Exchange and Inference Language for Medicine. Continuity of Patient Care with PCAST Disaggregation. Barry Robson*, Thomas P. CarusoΏ and Ulysses G. J. Balisΐ Quantal Semantics Inc, Virginia, US. and also *St. Matthew‟s University School of Medicine, Grand Cayman; *The Dirac Foundation, UK; *University of Wisconsin-Stout, US; University of North Carolina, and University of Michigan, Michigan, US. Tel: (00)1-345-3199 x 193; Fax: (001)1-345-945-3130; robsonb@aol.com We describe here the applications of our recently proposed Q-UEL language to continuity of patient care between physicians, specialists and institutions as mediated via the Internet, giving examples derived from HL7 CDA and VistA of particular interest to workflow. Particular attention is given to the Universal Exchange Language for healthcare as requested by the US President‟s Council of Advisors on Science and Technology (PCAST) released in December 2010, especially in regard to disaggregation of the patient record on the Internet. To illustrate many features and options, one of our most elaborate configurations combining them, for disaggregation and reaggregation, is described. The Q-UEL tags used do not physically join, but query each other from a random mix via the application. Despite the computationally demanding complexity of the configuration with two joining tags for each data tag and four independently evolving keys, plus a valuable but rate limiting isomorphism test, packets of essential clinical data for patient could be recovered and displayed every 2 seconds for a “club” of 30,000-50,000 patients in the mix. All computation here is on a standard laptop, but for practical use of the Internet to display downloaded data, the above is adequate, so focus is primarily on increasing club size. In practice, it is not necessary that a club comprise an entire nation. Assuming that one does not use purely random assignments of patients to arbitrary clubs, there could for example be a club comprising all schoolchildren in Scotland, or a club comprising all military veterans in Illinois. In such cases, one is typically dealing with clubs each of the order of a mere million patients. Using such club sizes efficiently, and in principle even a club the size of a whole country, appears to be possible. Keywords: Universal Exchange Language, PCAST report, Continuity of Care, Interoperability, Electronic Health Record, Disaggregation 1. Introduction and Review. 1.1. Background We recently made suggestions for a web-based universal exchange and inference language (Q-UEL) for medicine [1], based on generating medical knowledge by data mining many patient records (e.g. ref [2]) and authoritative medical text using XML-like tags as artifacts1. However, we see many of the same considerations in the continuity of care (COC) for a patient, where the most important artifact is just one patient‟s electronic health record (EHR), or a subset of information on it, exchanged between stakeholders (healthcare providers, authorized players) such as the patient, physician and pharmacist. Our particular use of the term COC in this report applies when stakeholders are in different institutions, networks of care such as accountable care organizations [3], and even different countries. We focus here on COC seen as a topic within the domain of health information exchange (HIE), so that the present report is primarily directed to workers in that computational and communications field. However, it has been pointed out to us that several aspects will be of direct interest to stakeholders such as physicians. For that reason, sections and major subsections tend to start with an overview of the broader significance of what follows, when it is a very technical nature. Some aspects that may have more direct impact on healthcare stakeholders are best described later below (Section 2.13) after some review and explanatory discussion. The main motivation for this report remains as follows. The HIE field is addressing matters that go beyond sending medical information via a fax machine with humans doing all the information processing. However, in the absence of strong consideration of COC so far, there has been no strong selective pressure to inhibit appearance of multiple standards, divergent evolution of standards, and variation in use. This presents an interoperability challenge with opportunities for basic research in the computational sciences, and researchers interested in several disciplines [4-10]2 may also be interested in the present report. It is not least a challenge as COC is envisioned in the following US Federal report. “Artifact” (or “exchange artifact” or “communication artifact”) is a term increasingly used in the emerging medical IT interoperability, and especially the continuity of care field, for any construction that carries and represents a packet of transmitted medical information. Traditionally, for data from one patient, it is a transmission in some kind of specialist messaging language or an XML document. XML documents contain tags describing, containing, and delimiting text as information. Such tags comprise strings within angular brackets <…>. Q-UEL also uses these, though also because, by a remarkable coincidence, they form the basis for packets of information in a notation used in physics [1], as also described below. 2 HIE between systems with differing vocabularies, ontological structures and worldviews makes the challenges present significant semantic challenges. “Big data” mining remains directly important for assessing the quality of COC [4]. Selecting best diagnosis and therapy for a patient is the primary use case of COC, and clinical decision support systems (CDSS) use challenging and usually probabilistic concepts that along with semantics relate to artificial intelligence (e.g. Refs. [5-8]). Our Q-UEL approach [1,8] is rooted in quantum mechanics (QM) [9,10]. Not least, information exchange in COC increases the 1 1.2. The PCAST Report. Recent US government emphasis has been on ensuring that diverse EHR implementations have at least common required capabilities called meaningful use [11], but this seems a considerable compromise after the more radical proposals in a controversial report by the President‟s Council of Advisors on Science and Technology (PCAST) in December 2010 [12]. It was concerned with the diversity of entrenched standards for representation of the EHR and expressly proposed a single universal exchange language (UEL) for healthcare. PCAST proposed that “…existing standards groups would publish mappings of existing vocabularies and content standards … into the adopted markup language. This straightforward step immediately expands the semantically meaningful realm of tagged data exchanges to include data that are coded in these existing standards.” PCAST did not require that the UEL was expressed in any existing standard, speaking of UEL as “XML-like”, and stating “We believe that the natural syntax for such a universal exchange language will be some kind of extensible markup language (an XML variant, for example) capable of exchanging data from an unspecified number of (not necessarily harmonized) semantic realms. Such languages are structured as individual data elements, together with metadata that provide an annotation for each data element” (our italics). A strong feature of the PCAST proposal was the desire for a UEL to make the patient record available anywhere, anytime, via the Internet, to authorized persons for the benefit of the patient. To that end they proposed a mechanism for added security, privacy and granularity as disaggregation “of complex records into the smallest possible data elements”. 1.3. Activity since PCAST: Reactions of Established Standards Organizations. PCAST asserted that developing a UEL “incorporates these standards into the new architecture, leveraging the work done by thousands of people for decades”, but the implications are controversial. Once a sufficiently powerful UEL is consolidated and stable, there seem no overwhelming reasons why EHRs and messaging artifacts should not stay in a UEL form, effectively displacing the standards, except that whole communities and many applications are built around each standard. However, PCAST was suggesting that a UEL may be the route of overall least effort as a universal second language. Q-UEL positions itself that way. Not least, bidirectional communication between N distinct standards and implementations of them requires developing 2N conversion procedures if a UEL is used as a single hub language, but N(N – 1) are required if a hub language is not used. Despite that, standards bodies still appear to feel that replacement by a UEL remains a possibility and threatens their authority over their domains. Certainly it is the case that the major standards bodies are placing their efforts demand for innovation in security and privacy, one of several controversial aspects discussed in Section 1.2. on extending their standards to provide COC and strengthening interoperability between centers of care and variants of their standard (Sections 1.8, 1.9), as well as on satisfying meaningful use [11]. 1.4. Activity Since PCAST: A Healthcare Role for the Semantic Web. The Semantic Web (SW) [13-15] seeks to go beyond the current web that links web pages by linking all data and knowledge through a hub of common meaning, accessed by URLs (links), i.e. by the RDF method [14]. PCAST also stated, “The physician would be able to securely search for, retrieve, and display these privacy-protected data elements in much the way that web surfers retrieve results from a search engine when they type in a simple query.” Consequently, we argued [1] that development of the medical SW would best satisfy the PCAST proposal. It would also enable clinical decision support systems (CDSS) that have been developing for many years, though traditionally as “off line” expert systems using human experts to type in rules held locally [16]. It would allow such knowledge to be pooled and shared as it also satisfied growing demands for patient health information integration at a more nationwide level [17]. It would ideally need to be a broader WW4, a Thinking Web based on probabilistic semantics [18]. There is an essential absence of even basic features of the SW and its RDF based approach [15] in the current EHR and HIE standards, so our proposal of QUEL was at that time unusual, but in 2013 the Yosemite Manifesto [19] proposed that “electronic healthcare information should be exchanged in a format that either: (a) is an RDF format directly; or (b) has a standard mapping to RDF”. We signed to this because Q-UEL has such a mapping. 1.5. Other Efforts After or Relevant to PCAST. The first steps of development of Q-UEL as a PCAST-like UEL solution had rather little to draw on except the preexisting standards and the PCAST report itself. Growing calls for a universal EHR prior to PCAST (e.g. Ref. [17]) were still very recent. Probabilistic semantics is still not a settled discipline (e.g. Ref. [18]), and the particular probabilistic theory used in Q-UEL only goes back to 2007 [20], albeit based directly on mathematics developed by Dirac in the 1920s and 1930s [9,10]. There are, however, several recent efforts in COC that come close to a UEL in their potential effect. For example, EPIC is very active [6], and in a sense does pursue the position of the lingua franca of EHRs as well as the end point for all health information. The Health Record Bank Alliance [21] promotes another solution which allows health record archives to aggregate data for an individual. There is much effort that reflects the trend in meaningful use [11] to allow patients and health-conscious individuals variously to access, interact with, generate, or control use of their data3. The relevance here is that such efforts are increasingly making use of JSON (JavaScript Object Notation) [22], in some sense a UEL as discussed in Section 1.12. It probably only comes closest to being a web-based UEL, however, in a still rather special forms like JSON-LD [23], which link web data. There are recent efforts to develop extensions to existing EHR standards for COC and interoperability [24, 25], but they are not truly a UEL because focus is on communication within a standard and between variants of it. Statistical overview of the quality of future COC is seen as very important, as is the quality of the clinical data itself (e.g., the National Quality Forum and the Quality Data model [26]). There is growing interest in data mining HIE [4] and in web based probabilistic methods (e.g. Refs. [27-29]). Q-UEL owes a considerable debt to many efforts concerning biomedical and general biological information and the SW [31-52], and will benefit from relevant advances in high performance “big data processing” (e.g. ref [53]), though it is notable that this large body of work rarely touches upon probability. 1.6. Data Quality Representation and Controversial Probabilistic Aspects. Probability does not immediately spring to mind as an issue for patient records. PCAST did not address it, except implicitly in regard to data mining, and most stakeholders take patient records as fact, albeit with a cautious eye for any possible irregularities. However, Q-UEL is based on Dirac notation and algebra for quantum mechanics, so it is natural for us to consider the uncertainties inherent in observations and measurements and the probabilistic inference from them that characterize that discipline. A physician comparing data for a patient with that from a population is not in a fundamentally different position from a physicist comparing experiment and theory. For example, diagnosis should ideally be based on comparing the joint distributions of data dispersions for the single patient with those for the same kinds of measures from populations, in order to compute the probability that the patient is, or is not, in a normal state of health. The same idea is seen (albeit usually using rather extreme simplifying assumptions of normal distribution and independency) in the routine use of “normal range” of each clinical value, but properly considering dependence upon other measurements and denominators such as sex, location, ethnicity, genetics, personal and family history, comorbidities present, and lifestyle. Hence Q-UEL also allows for representations of more complicated probability distributions as vectors and matrices [1]. That for inference purposes Q-UEL interprets these as algebraic values of tags treated as Dirac algebraic entities may look controversial outside of physics, but that such information should be conveyed in some kind of way is essentially classical statistics. The real controversy is about how much extra detail of this probabilistic nature 3 They will be reviewed elsewhere, but they include, for example, BlueButton, Validic, FitBit, Jawbone, Moves, and Withings, the Patient-Powered Research Networks, and the Self-Generated Health Information Exchange (SGHIx). that any kind of UEL effort should compute and put on its tags or other artifacts for specific patients before it appears to be imposing its worldview. Our preferred methods for computing probabilistic quantities must stand alongside many diverse offerings in CDSS that have traditionally been innovative and controversial in theory and method from the outset [16], and those few SW efforts that are probabilistic can have different kinds of probabilistic measure as input and output, e.g. Refs. [27-29]. Not surprisingly, medical record standards bodies consider probabilistic interpretations as out of scope and matters for developers of end-user analytical applications, who agree. Any further debate on this may even be premature. All players await widespread automated feeds from laboratory information management systems to provide greater mention of data quality in source clinical documents. Q-UEL has a mechanism for deferring probabilistic aspects (Section 2.2). 1.7. Conversion to Q-UEL: Use of Ontology and Nomenclature Standards. The focus in this report is on the appearance of Q-UEL tags for COC and on their disaggregation, but comparison of Q-UEL with major standards efforts in the following Sections is most usefully done in the light of studies of how readily it interconverts with them. While the sense in the field is that standards interconversion requires sophisticated semantic methods, it seem much easier when a UEL is used and used in the manner that PCAST described (Section 1.3). One can use brute force and write specific converters, one per standard. For Q-UEL, it involves “hand crafting” tags within Q-UEL‟s PERL-based applications [1] which are then generalized to become matchand-edit instructions (“regular expressions”) in converters. Most effort goes not in expressing and using the important clinical information but into preserving information about the complex semantic and ontological structure of the source in order to convert records back into that source (although that is not in principle essential for a UEL Section 1.3). It is desirable to think of any UEL as going beyond a simple “drop box” for content from different standards. Q-UEL already liked to describe data in many nomenclatures, e.g. representing molecular formulae of drugs to interact with chemoinformatics applications, and in an ontological graph structured way [1]. This used QUEL‟s XML-like attributes extended by attribute metadata language (AML). It relates to PCAST‟s “structured as individual data elements, together with metadata that provide an annotation” (Section 2.1). It also relates to the SW, because Q-UEL tag structure as semantic triple [15] expressions with AML-based attributes as arguments allows tags to carry the relevant cross-language dictionary and “grammar” with RDF references where necessary. By AML, Q-UEL‟s simple methods for specifying sources and nomenclature are essentially the same: data elements can be like current children in a family tree that threads back by line of descent, specifying the nomenclature used and then the source that used it, highlighted by a CODE attribute. For example, in we might see the string SOURCE:=CODE:=„HL7 C-CDA(R1.1 CCD)‟:=…CODE:= „SNOMED (RF2 ICD 10)‟:=.... where the ellipsis „…‟ implies other metadata, brackets (…) for graph structure, RDF links, and/or values [1]. The current use AML in COC is described in Section 3.2. 1.8. Comparison of Q-UEL with Continuity of Care Efforts: HL7 CDA/CCD. There are currently several large-scale COC interoperability efforts such as the Standards and Interoperability (S&I) open government initiative [24] noted above, which we monitor and in some cases participate. The S&I efforts primarily focus on HL7‟s use of XML as Clinical Document Architecture (CDA) [54, 55]. It replaced the older HL7 V2 messaging that was not XML-based, but still widely used because updating installations to CDA is not trivial [56]. Of particular interest to us has been the eHealth effort [25]. It is technically part of S&I, but tackles the integration of European and US healthcare that we presumed would be a challenging case for COC. The challenge for HL7 interoperability in general is that CDA versions have shown considerable evolution [5759] and each implementation in an institution can differ from others, because COC outside its boundaries has previously been seen as less of an issue. Matters are harder still between countries. HL7‟s CCD [59] seen as a messaging artifact is a restricted subset of CDA that is very directed to US healthcare practices and processes. This made it not so trivial for eHealth [25] to convert or have interoperability between CDA in other countries such as in the European epSOS initiative [60]. Both use the HL7 CDA R2.0 basis but it is specifically a matter between HL7 C-CDA R1.1 CCD and epSoS PS v1.4. Faced with restrictive deadlines, the eHealth goal is, at this moment of writing, essentially a whitepaper demonstrating what appears feasible. Q-UEL has focused on epSoS CDA [60] because it might reveal new perspectives by not being US-oriented. Indeed, epSoS documents are fairly rich in specifying stakeholders including those managing the documents, all of which can provide information for provenance, workflow management, and triggering events in COC (see Sections 1.9, 4.2). 1.9. Comparison of Q-UEL with Continuity of Care Efforts: VistA. The US Department of Veterans Affairs VistA system [61] is widely used in the US and at the time of writing is being considered for deployment in other countries, notably the United Kingdom. Vista content is attractive for considering workflow because it tends to provide very detailed accounts of “triggered” events involving such as prescriptions (Section 4.3). VistA is also well known to vary in implementations and there has similarly been a significant rise in VistA activity since PCAST. VistA is not based on XML but on the old but ingenious MUMPS or “M” programming language, originally developed for medical applications in the 1960s [62]. Its features encourage a programming style where expressions resulting in data base entries can be little ontologies very like attributes using Q-UEL‟s AML (Section 1.7, and Ref. [1]). One can copy MUMPS source code and automatically edit it by, say, a Perl script to surround data written to a database with Q-UEL tag features, so writing Q-UEL tags directly to a database. Some VistA implementation features obscure the Q-UEL affinity so that it is often as easy to intercept input and output, but its structure is that which the MUMPS code implies. The overall effect is a rather granular one, emphasized in the following. 1.10. Comparison of Q-UEL with Continuity of Care Efforts: EAV Models. Granularity is a notion highlighted by design in the Entity-Attribute-Value or EventAttribute-Value (EAV) model [63, 64]. EAV is not a specific standard or language, but it has long been used as a concept in an on-going effort by many data miners to convert XML documents and, in particular, HL7 CDA, into a more accessible interoperable form, at least as far as data mining is concerned [1]. In effect, the EAV approach starts with unstructured data mining including use of text-analytic and SW tools to convert source into a more granular structured form. What is usually meant by EAV is a table with data recorded as just three specific columns: the entity or event such as the patient, the attribute or parameter such as “pulse” usually linked to a table of attribute definitions and other information, and the value of the attribute, such as the pulse rate as a number. Q-UEL‟s AML could in principle similarly have „patient X‟:=pulse(beats/minute) :=80 or „patient X‟:=pulse(beats/minute):=www.qexl.org/pulse:=80. However, starting only with descriptions like that, one would obviously lose ontological structure as the organization of the data of the patient as a whole that other sources like XML provide. Compared with XML, EAV is generally criticized as not being relational. Q-UEL is a relational EAV model, with a relationship or predicate as in a semantic triple [15]. More usually, though, when Q-UEL uses or creates EAV structures, “relationships‟ means an XML-like ontological structure that AML can encode, as described in Section 3.2. The following also have EAV flavor. 1.11. Comparison of Q-UEL with Continuity of Care Efforts: HL7 Pipe-hat. HL7‟s older V2 in many respects it follows the EAV model as a messaging system, often called “pipe-hat” due to its pipe „|‟ and hat „^‟ delimiter characters. Pipe-hat could not be described as an ongoing effort because further development of it is deemphasized by HL7 in favor of the XML-based approach, and so our own efforts here are preliminary. However, it will become important for pipe-hat users to convert to another representation, and ironically, while pipe-hat conversion to XML is somewhat difficult [59], conversion to Q-UEL is relatively easy because of the granular EAV nature. A distinct string in a pipe-hat message describes, say, the reaction to codeine (an allergy or AL message string found in, e.g., HL7 V2.5.1 pipe-hat), which can become a Q-UEL attribute. Metadata are consistent but variations in expressing content are considerable. 1.12. Comparison of Q-UEL with Continuity of Care Efforts: JSON Overall, we see JSON [22] as a newly arising challenge to Q-UEL although it has been developing since about 2001. JSON is currently mainly used to transmit data in general between a server and web applications. What is transmitted is not confined to medical data, but it already has the considerable advantage over Q-UEL of having a significant number of medical installations. In principle, JSON poses little more competition to QUEL than does traditional XML. Where JSON argues that it is superior to XML, the arguments include those which Q-UEL uses to indicate its own superiority to XML, i.e. that XML is comparatively verbose, hard to read in practice, slower to process, inefficient at handling of certain types of data, has a need to “escape” many characters that have special reserve functions, and lacks association with any family of programming languages. XML argues that compared with JSON it is no less easy to read if formatted correctly, and is extensible and flexible. It could be argued that Q-UEL satisfies both, primarily by building on XML towards a programming language form (and retaining miscibility with, but not absolute dependence on, Perl [1]). Nonetheless, JSON‟s features, such as attribute-value pairs, are already Q-UEL-like and give JSON some status as a UEL with EAV flavor and make conversion to Q-UEL relatively easy. Consistent use of JSON for the EHR is not yet established, so Q-UEL interconversion with JSON is preliminary. There are significant variations in implementation because new organizations often use JSON as their first entry into encoding patient data. In practice, JSON‟s recently rising popularity as a candidate for the EHR is probably largely because of low development and maintenance costs compared to existing XMLbased solutions. It is also seen as compatible with NoSQL type databases which allow well-structured and efficient storage at typically less cost than relational data base solutions. JSON is also exchangeable with web-based technology (like NodeJS on server side and html/javascript), although as noted above, JSON itself arguably really only comes close to a web-based UEL in forms like JSON-LD [23]. This allows an application to start at one piece of linked data, and follow embedded links to others hosted on different sites across the Web. However, Q-UEL is not technically behind here: for example, Q-UEL‟s X-tract tags gather their own data content [1] by automatically surfing the web to spawn new X-tract tags, each containing a canonical rephrasing of a chunk of the source text in a way that maps to semantic triples. JSON may also be popular simply because there are many programmers familiar with the Java-like family of languages. Like Q-UEL, JSON essentially represents a programming language, in JSON‟s case by being developed from JavaScript, but JSON as usually applied is simply concerned with emulating XML layout in JavaScript format. To our perception it is not a desirable format change, considering the useful similarity between XML tags and the Dirac notation that provides a system of probabilistic algebraic objects. JSON make no such claims. We do not really have a UEL for healthcare if we actually use a different language like JSON for each different aspect of it, but even confining debate to the COC domain, QUEL was designed to support several features directly important to COC at a fundamental level. These include disaggregation, a powerful AML attribute structure, and RDF references. Also, arguably important in the future is Q-UEL‟s treatment of measurement uncertainty and its power as a probabilistic algebra (Section 1.6). 1.13. Features of the Present Report of Interest to Stakeholders in Healthcare. The journal invited us to describe how the present work could have impact on the work of stakeholders such as physicians (Section 1.1), who could therefore also be interested readers. In general, overcoming the concerns expressed in the 2010 PCAST report would, of course, be beneficial, and stakeholders are referred to that report [12]. The following points may be particularly born in mind. a. Physician Authority. One aspect of PCAST 2010 that is controversial for physicians is that disaggregation and increase granularity also appears to have been favored by PCAST [12] because each stakeholder need only see what he or she directly needed to know (Sections 2.4 and 5), in contravention of traditional physician authority and holistic principles of medicine. However, these matters are up to future public opinion and legislature, they are not an automatic consequence of disaggregation. The technology described below allows flexibility. It could put more control in the hands of patients, or of personal physicians, or both in collaboration, by means of a fine grained consent language written into the source record (Section 3.6). b. Simplicity of Applying Disgaggregation. When a patient is travelling abroad and the latest EHR or medical summary is needed, it can be quickly obtained from the Internet, by those authorized and consented to see it. No special server configuration is required; a simple website would suffice (indeed, a patient or traveler in general could simply distribute disaggregated data as attachments across email providers and drop-boxes, as a method of keeping copies of important documents that might be lost while traveling). If the reaggregation application that reassembles the disaggregated data is not at hand to the physician, a basic one could be downloaded freely from the web. Armed with a reaggregation application, all that is required to obtain the medical data is, even in the rather elaborate setup described in this report, either some four passwords or keys, or a digital certificate and a “job number”. It could be simplified to a single password. If a legitimate medical worker in a foreign country lacks recognized authority, then that can be assigned as a kind of separate key. c. The Patient Record as Query to the Web. In the longer term, a UEL with such a semantic structure will be able to interact fluidly with a future medical Semantic Web particularly designed to help physicians and patients. It is particularly well seen as an aspect of COC when a single specific patient record is transmitted and reaggregated and plays a central role in clinical decision support: in effect it can be, in whole or part, the query: “What does this predict for the patient, what can be best done to enhance the probability of a favorable outcome and future, and what gaps in knowledge about the patient need to be filled?” d. Medical Notation and Educational Value of Q-UEL. Perhaps least obviously, Q-UEL also aspires to be a medical notation, so there is perhaps more interest to be found by a physician or medical student directly looking at (unencrypted or decrypted) Q-UEL than is typically the case with other computer languages. Originally, it was felt that the ability of medical personnel to read directly and understand tags or similar transmitted artifacts might be a lifesaver when the infrastructure fails or becomes potentially overloaded, as in New York on 9/11, but when texting, faxing, or couriers are still possible [1]. Q-UEL stripped of web management details (which are nonetheless not too distracting) seems to provide an intuitive notation despite its origins in theoretical physics. It is capable of describing patient data, evidence based medicine (EBM) and epidemiological measures, medical knowledge, and healthcare workflows. Aspects of it are currently being taught tp medical students in an EBM course by author BR. It remains that the larger part of the present report will be of more interest to systems designers, algorithmists, and developers in the HIE domain. To aid such persons, an established mathematical paradigm, such as the Dirac notation in the present case, is always a useful guideline as follows. 2. Theory 2.1. Q-UEL Medical Notation. Many of the considerations specifically required for COC might not be expected to be very theoretical, but the tag structure and algebraic interpretation of Q-UEL based on Dirac‟s QM does make it well suited to the PCAST view of COC. As for Q-UEL in general, the Dirac notation determines the format of what is yet to be disaggregated, what disaggregated encrypted and consented tags look like, and summary information from data mining all these. The commonest Q-UEL tags express Semantic Web semantic triples in the form of Dirac‟s bra-operator-ket. They are formally probability duals (Pfwd, Pbwd) [1,8]. < subject expression Pfwd:=x | relationship expression | object expression Pbwd:=y> Tag value attributes Pfwd and Pbwd represent the tag‟s algebraic value, the h-complex probability dual (Pfwd, Pbwd) [1, 5]. For example, Pfwd = P(“Type 2 diabetes causes obesity”), but Pbwd = P(“Obesity causes type 2 diabetes”). Pfwd and Pbwd are not of further concern here for reasons in Section 1.6, and the implied dual is (1,1) = 1. However, Pbwd:=0 will be seen to indicate irreversibility, as in workflow. The default value for any quantity not mentioned, including a probability, is 1. Q-UE”s deferment “mechanism” (Section 1.6) is that probability 1 does not necessarily indicate “100% true” but can mean ignorance about probability or deliberate choice not to convey it, lack of communicated information I = -log(P) = -log(1) = 0, a statement posited awaiting any possible refutation according to Popper‟s principle, and lack of quantitative impact in a purely multiplicative inference net [1,8] as if it were absent. Although explicit probabilities Pfwd and Pbwd are deemphasized in this report, the same controversial probabilities for single patients (Section 1.6) appear in the theory for disaggregation. Without disaggregation, a particular combination of attribute values can have a very small joint probability, and that could uniquely pinpoint one record out of millions. It might perhaps ultimately identify, to a dedicated, unauthorized and malicious person, the one patient who has that combination. Conversely, with disaggregation, we must design matters so that if an unauthorized person might be able to decrypt every shred of data from many patients in a mix, they still could not, even with substantial knowledge of a patient, see which pieces join up with any high probability. The probabilities of greatest interest now become those that relate to “bounds” on these, i.e., reflecting the fact that it is good to make illegal access more improbable than seems to be required, but not less so. Notwithstanding that, they must also represent a balance, the probabilities with which an authorized person can correctly reaggregate the attributes in reasonable time, versus the probabilities that a malicious, determined, and computationally well-equipped unauthorized agent might also do so. Note that between those two, any disaggregation solution inevitably presents a notion of error, of a certain probability that the tags joined do not long belong to the same patient. Equivalently, with appropriate verification mechanisms, that will manifest less dangerously as the problem of persistently failing to find tags that do join. Evidently, disaggregation systems must be designed so that probabilities of undesirable effects are astronomically small. This is primarily a matter of matching between strings carried on disaggregated tags and strings generated by the reaggregation application. That is, the above kinds of probabilities are not written explicitly on tags in Pfwd and Pbwd attributes, but as seemingly random character strings called join strings as values of other special “tag value” attributes. They proactively determine, not passively report, the probabilities involved. In effect, it is as if an unauthorized person has to guess a new password to rejoin each and every data element while also guessing which tags belong to the relevant patient, and the system can be designed so that these two are not the same thing. 2.2. Q-UEL Tags in Continuity of Care. The tag examples most prevalent for COC applications and that also use the default of probability 1 are as follows, with the exception of (4) that is a statistical summary over many patients, and (5) that displays Pbwd = 0 to indicate irreversibility. Types (6), (7) carry the seemingly random join strings mention in Section 2.1 that enable their reaggregation, but perhaps surprisingly (8), which is the only tag that carried reversibly encrypted clinical data, does not (see discussion below). (1) < patient ID and demographic data | has | stakeholders and clinical factors > (2) < patient arbitrary ID | consented | clinical factor > (3) < patient arbitrary ID | consented jointly | clinical factors > (4) < clinical factors | when | clinical factors > (5) < patient ID and/or demographic data and/or stakeholders and/or clinical factors | triggered | stakeholders and/or clinical factors > (or “should trigger” etc., see Section 3.5) (6) < irreversibly encrypted mapping to tag type (7) and isomorphic mapping to other tag types (6) that come from same record | (7) | irreversibly encrypted mapping from tag type (6) and to next tag type (6) in sequence >< irreversibly encrypted mapping from encrypted data element(s) on tag (8) | (8) | reversibly encrypted clinical data element(s) and optionally authority > The functions of these tags as follows. Tag type (1) is essentially the medical record or sub-record as a selected part of it. Tag types (2) and (3) are only released if consented to be so, subject to fine-grained consent instructions in the source record. Simply by containing more than one clinical factor, type (3) poses more risks to privacy than type (2). Usually having visible content, they are freely available for data mining for research including quality control. The results that produces, as statistical summaries over many patients, are represented by tag type (4) described in Ref. [1]. A type (5) tag records a workflow event that can be used to update tag (1). Types (1) and (5) generate many encrypted tags of form (6), (7), and (8), and optionally many of the tags consented for research. Use of tags (6), (7), (8) is the case considered in this report. Three components are required to store and transmit a single data item d, so this is called a triple shred of conceptual form <…| x |…><…| x |…> where the x represents joining4. 2.3. Disaggregation into Bra, Ketbra, and Ket Tags. While placing patient data on the Internet seems risky (Section 1.13 (a)), the disaggregation approach aspires to turn this risk on its head, and to advantage. The real impact of disaggregation as a security feature is in the role of an addition to encryption when disaggregated data elements are shreds mixed with shreds from hundreds of thousands or millions of other patient records. The additional security feature this provides is really the challenge of finding several specific needles in a haystack of needles that all look like equal candidates. Dirac‟s notation, and the physics it reflects, seem helpful by providing two particular principles as a conceptual framework for this challenge. First, Dirac notation carries an inherent notion of aggregation. Entities <A| and |B> are representations of physical states analogous to nouns or noun phrases in a natural language but aggregated forms as products <A|B> (shorthand for <A| times |B>) and <A| R |B> also exist as single algebraic objects. They describe the relationship or 4 Contrast the simplest possible disaggregation case of encrypted data elements d 1, d2, d3 etc carried in corresponding tags <…d1…> x <…d2…> x <…d3…> ; this constitutes a single shred, meaning a single shred per data item. One can also have a double shred using only tag types (6) and (8) and of conceptual form <…| x |…>, and a multiple shred using (tag types (6),(8) and multiple tag types (7), of final form <…| x |…><…| x |…><…| x |…><…| x |…>….<…| x |…>. transformation between the states, and are analogous to a simple sentence. Second, there is the notion of an entity called an index as describing what is essentially the same state irrespective of how we measure or name it, so that we might write say |3>, and the notion of operators increasing or decreasing an index, say of converting state |2> to state |3> or vice versa respectively. The aggregation <1|2><2|3> makes physical sense by the “chain rule” [8], an expression that estimates <1|3> with certain independency assumptions, but makes no sense seen as a vector or matrix product of <1|2> with <2|3> because each is scalar (though there is a relationship, making indices the more general method, as decribed below). In practice, the index must relate much less obviously to the join string (Section 2.1), including by increasing the index, and the way of doing this is explained below (Section 2.5). The reason why the above tags (1)-(8) had the forms of bra <…| (a row vector), of ketbra |…><…| (a matrix) and of ket |…> (a column vector) respectively is that in Dirac notation <A| R |B> is also an expression as a non-commutative (order dependent) product, reversal of which can be seen as an aspect of disaggregation or shredding, i.e., <A| R |B> → (<A|) (R) (|B>) with R representable by some product |X><Y|. It represents the disaggregation of each data element B from a record, A and R having linking functions. Recall that <A|B> and <A| R |B>, although products of vectors and matrices, are not themselves vectors or matrices but simply scalars (usually, however, complex scalars, i.e. with an imaginary part). Because the scalar value contains less information than the vectors and matrices that each represent arrays of many such scalar-like elements, it is aggregation, not disaggregation, that is associated with information loss. This is contrary to what we would normally think of in regard to entropy increase in a document shredding process. One might imagine the algorithm that the available vectors and matrices are generated so that only particular ones will aggregate to give an object with a particular required scalar value. In practice, this is complicated and inefficient, and when we first disaggregate a large <…|…|…> form into many smaller <,,,|…|…> it seems a less appropriate model. Indices become useful as more generally applicable. An index approach is also a general kind of symbolic representation of a vector matrix-approach, because in the above example <1|2><2|3> one could have considered it as meaning <1| R |3> where <1| and |3> are vectors and R = |2><2| is a matrix. Theoretically, what disaggregation is doing is adding a dimension of entropy protection on top of encryption. It is as if an encrypted patient record on paper is shredded by an office shredding machine into a trash hopper containing encrypted shreds from many other patient record shreds, yet can be re-aggregated from the mix on demand by authorized persons with appropriate keys and/or digital certificate. The hopper mix in QUEL jargon is a “tag soup”, and a basic Q-UEL principle is that all Q-UEL tags, for any purpose, can reside in this soup, recoverable by queries. Fig. 1 exemplifies how this happens by joining just two clinical data elements that correspond to QM measurements 5 and 10 in the figure, the values being reversibly encrypted character strings. These are the only two real observations or measurements. Joining these and the other “virtual” measurements as character strings makes use of the idea of features that are mutually degenerate after certain index operations based on evolving keys, i.e. closely but obscurely related (Section 2.5). The essential “marriages” of degenerate measurements are (2,3), (4,5), (2,7), (7,8) and (9,10), in practice in that order. (1,2), (3,4), (6,7) and (8,9) are permanently joined. Remarkably, reversibly encrypted real data need be all that each ket carries, by use of a further irreversible encryption to a string on the ketbra tag (8). Bra tags and also ketbra tags have rejoining functions to rebuild the source record. (4,5) and (9,10) link bras to ketbras, and (2,3) and (7,8) bras to ketbras. It is natural to think of (1,6) as joining bra to bra, and this is so, but theoretically (2,7) would suffice. Join (1,6) has a verification function, but it becomes essential as the information contents of measurements 2, 3, 7 and 8 are reduced by taking small substrings of them to enhance security. The relationship between strings 1 and 6 is that they encode two graphs isomorphic to each other, i.e., they are really the same graph expressed differently. Note that for security the entities represented by disaggregated tags never actually physically rejoin in reaggregation: a tag initiates an automatic query for the next via a receiving application. Fig. 1. 2.4. Disaggregation and Reaggregation by Data Elements. The implications of this aspect of the 2010 PCAST report are perhaps less obvious than may first appear. One has, perhaps, a mental picture of clinical data written on pieces of a child‟s construction set like Lego pieces, that can be plugged into each other and then unplugged. Strictly speaking, this is closer to granularity, which has many advantages but may or may not have something to do with disaggregation as a security feature. In practice, the principles that Q-UEL uses are also applicable to shredding and unshredding documents arbitrarily, without worrying what each shred represents. That could work for COC, and it does not require meaningful granularity. That is fortunate when one comes to shred, for example, medical images. However, apart from other advantages of granularity, arbitrary cut-points in shredding that ignore the granular nature of the content can sometimes be a bad security idea. If the shreds are illicitly decrypted, they can then start to look like pieces of a solvable jigsaw puzzle. By way of simple example, disaggregated shreds of data such as „systolic BP (mmHg)‟:=160 and „diastolic BP (mmHg)‟:=85 that keep integrity of data representation are (if decrypted) actually much harder to see as belonging to the same patient than arbitrary shreds “„systolic BP (mm” and “Hg)‟:=160 diastolic B” and “P (mmHg)‟:=85”. Much smaller shreds say of one character would get round that, but be inefficient! The PCAST report is often quoted as requesting “disaggregation by metadata”, which would translate as disaggregation by Q-UEL attributes, though the term actually used was disaggregation by data elements. Either way, the PCAST motivation for disaggregation coupled to granularity is apparently partly because the same granular structure can then be used to provide some authorities with subsets of all the data elements on a need-to-know basis. One need only, in that viewpoint, reaggregate what is required. This above issue of partitioning of data on a “need-to-know” basis touches upon a controversial area (Section 1.13) but gives some impression of what a “data element” needs sometimes to be, i.e., in practice, a block of related knowledge. A ket data tag, say |B>, will often contain just one data element (though there may be an authority attribute and additional web management features). Nonetheless, Q-UEL‟s AML also allows this one attribute to contain several metadata and values, perhaps in a hierarchic, ontological structure that could map to at least a part of an XML document [1], as a “bundle” of data elements. Attributes to be disaggregated from each other in QUEL documents are those separated by a logical and operator (or implicitly so, since and is the default). Note that immediately prior to disaggregation, non-identifying clinical attributes can be moved into the ket part. Strictly speaking, a disaggregation as an equation |A, B, C,…> = |A>|B>|C>… implies a dimensional product sometimes attributed to Grassman [11], not random association between A, B,C,... That implies the need for additional information, as follows. 2.5. Disaggregation Indices and Formal Irreversibility of Disaggregation. Indices lie behind the essential mechanism used for reaggregation. A disaggregationreaggregation model based on multiplying vectors and matrices was discarded above both as inefficient and as a QM analogy that is not always so well justified (Section 2.3). There was the more general notion of indices. In QM, indices are numbers that label energy states uniquely, rather than referring to them by descriptions in terms of measurements of physical values such as position or momentum. The term “mutually degenerate” in Fig. 1 is the term as used in information theory. The QM terminology would be that the measurements joined are the same state with same index as seen in a different measurement representation, after the index describing the energy level of one of them is raised by one. For example, simply using brakets, we might say that <1|2> joins to <3|4> because we raise the |2> state to give |3>, so allowing the aggregation <1|3><3|4>. In practice, we obscure such obvious relationships in several ways resulting in a method that is, admittedly, complicated overall. These ways are primarily (a) by working with functions of the indices, say f(3), as the “measurements”, (b) never actually allowing tags to join up, and (c) never actually having the function of a raised index represented on a tag but issued as a query for a partial match via the reaggregating application. The “measurements” of Fig. 1 are in practice the join strings as functions of indices mapping to strings called evolving keys that represent the queries via the application. At each stage of evolution these keys partially match (see below) the “measurements” on the tags, say key k(i) matches to f(i) where i is, or relates to, the index. Q-UEL developer jargon is often rather lax in distinguishing entwined terms such as “indices”, “measurements”, “join strings” and “keys”. “Disaggregation indices” relates to both evolving keys as queries and the “measurements” matched. As noted above, the matches between join strings and keys are partial. It is important that the evolving key strings are usually considerably longer than the join strings on the tags. In other words, as an added security feature, we depart from reversibility in QM, and the string to match on any tag is a small substring of the evolving key (though it is not so for measurements 4, 9 of Fig. 1 in this report). Counter to first intuition, the algorithm for disaggregation can be made deliberately irreversible by several such means. What actually happens in reaggregation is that we simulate disaggregation, as far as indices are concerned. The disaggregator requires no information to know what the next tag is, since it is “fed it”, but the reaggregator has to be continuously fed sufficient information by queries to overcome entropy protection and recover the tags from the tag soup. On top of encryption, entropy protection requires approximately nlog2N bits of information to locate n data items for one patient from N for many patients. 3. Methods 3.1. The Experimental System in Overview. The focus in Methods is partly on what unencrypted Q-UEL tags for COC look like, but even more so on disaggregation and reaggregation since in Q-UEL systems the EHRs etc. really spend most of their lives, stored as well as transmitted, in the encrypted and disaggregated form. The Q-UEL system helps to explore many possible working configurations. We here describe a configuration for which reaggregation is not very efficient because it uses, and hence illustrates, all the security features described in the present report, namely a full “triple shred” into bra, ketbra and ket for each data element and several systems of evolving keys. We used the QuantalCloud system configuration in Fig. 1 of Ref. [1] Fig. 1 as the test-bed. Fig. 2 below fills in that picture as required for the present report. The system is really a flexible research toolkit that can accomplish tasks in various different ways. For the kind of use described in this report, an EHR or part of it, say tag type (1) in the classification of Section 2.2, a medical exchange artifact say of tag type (5), an AML file, or any arbitrary document or image can be disaggregated by an application such as QuantalSHRED. It is disaggregated into tags of type (6), (7), and (8), but also tags (2) and (3) of visible content can be consented for research purposes and released by the same application. Note that only if medical data is in Q-UEL format or converted to it will shredding normally be by the data elements belonging to attributes. Otherwise, shredding will still occur but is arbitrary, or more correctly stated, the shredding is by attributes with arbitrary metadata names and values that are small chunks of the document or image. The application will also release a digital certificate, job ID number, substitution keys that can override entries on the digital certificate, and an authority key, all of which can be held by and/or received by an authorized person using QuantalShred in order to re-aggregate tag types (6),(7),(8). The arrow pointing back from the digital certificate to QuantalSHRED is because QutalSHRED could in turn shred that, generating yet again tags of type (6),(7) and (8) and the associated information required including another digital certificate. For the present studies, we used real patient data like that of Ref. [2] and a collection of example medical records in various standards, as well as records converted to AML file format and some data from public health studies as described below (Section 3.2). Data emulated as consented were released (deidentified, and still confidentially within our “laboratories”) at the same time with “plaintext” data, i.e. not encrypted. Reaggregation requires a system of keys or passwords most of which are normally (but not necessarily) carried on the digital certificate. When transmitted by other means they are substitution keys which the receiver can use to overwrite decoys in the digital certificate. 3.2. Interconversion of Health Records with Q-UEL. A great deal of work for the present report was done using EHRs and medical data exchange artifacts written in Q-UEL or a corresponding tabular AML format, not least because these were specifically designed to allow disaggregation in the PCAST sense, while other representations were not. However, in order to get real patient data in the absence of any true medical installations of Q-UEL at the time of writing, converting source medical data from other standards to Q-UEL is a necessary first step. A caveat on that is that we have substantial deidentified data directly obtained from some 2,000 volunteers in a public health study that “historically” represents the first instances of patient data going straight into Q-UEL format, as will be described elsewhere. Even so, we wish to tackle the standards interconversion problem. Fig. 2. Attribute metadata language AML plays a persistent major role in interconversion (Section 1.7). In designing any Q-UEL tag layout, truly scientifically meaningful ontologies of data that cannot be expressed at the same rank level are captured in attributes using AML. Data that are similar but do not exactly describe the same thing, alternative representations from different sources of the same data elements, or obtained at a different time as indicated by timestamp, can all appear within one attribute, as in the simple example metadata1:= (metadata2:=value2, metadata3 :=value3). The Q-UEL general specification does not insist that, say, an EHR be put on one tag with many attributes, but it is an interesting finding that it can be done because overall ontological structure of a source document often reflects rather arbitrary administrative matters in the philosophy of a standard. However, in order to facilitate converting Q-UEL back to source ontology, one may pre-append each value by its line of descent, a single path, metadata by metadata, from root node to the leaf node value being considered: metadata1:=metadata2:= metadata3:= …. :=value. We approach conversion of an EHR standard by first writing an AMLF or AML file. The columns contain the values, and there is one row per patient. The metadata is the first row of the file (column headings), and that will be, especially for XML source documents, a fairly long string describing the line of decent applicable to all values below it in the column. This AML file is used to write the Q-UEL tags, but to avoid visual clutter on those tags, a format such as metadata:=www.qexl.org /AMLF/QXML5/Header6/:=‟systolic blood pressure (mmHg)‟:= 140 is used, where Header6 means the heading of the sixth column of the AMLF called Q5XML with the last metadata name (directly preceding the value) reproduced. It is subject to the encryption rules in Section 3.5. Finally, for commonly used tag types, attributes are rationalized (combined and simplified) in various ways to satisfy Q-UEL human-readable style [1]5. 3.3. Tag Status. As indicated above Q-UEL is being using directly in at least one health study, and QUEL tags also get interchanged in collaborations with other workers involved with real, but sometimes contrived, patient data. There is always a small risk that data on such tags, once out of our hands, could be directly or indirectly accidentally used for a real patient. There is also perhaps a larger risk that directly or indirectly such data is included in the data mining of what is supposed to be all real and reliable patient data. Conversely, we can also foresee cases in which, analogously to the findings of another interconversion study [25], real medical data is only partially converted but still considered usable for the patient. For all such reasons, Q-UEL tags carry annotation as to their status and whether they can be used for medical purposes. All Q-UEL tags, including design suggestions and simplified examples are executable, and status is indicated as an appendage to the tag name (the tag name is seen as just a special kind of attribute). It shows provenance, including say, whether HL7 CDA/CCD or VistA was the source (should there be multiple sources, the SOURCE metadata name is also associated with the data in attributes: see Section 1.7). There, “handcrafted” will indicate that the converter was at least in state of transitioning from specification to actual use. Exception comments, i.e., in {!...!} brackets elsewhere in the tags, clarify that or indicate deliberate omission of data, and often imply that Q-UEL is being created but that reversal is a problem. Q-UEL binding variables, each a $ character followed by one 5 Notably, reserved words are implemented that can stand for the same idea in different source vocabularies, and for common RDF references. The reserved words can be functions of tag name and SOURCE (Section 1.7), and their meaning therefore resides in converter subroutines, not directly in RDF references. This distinction, however, is not absolute in Q-UEL‟s design. Specification allows that these subroutines can be referenced by RDF links on tags, including as downloadable code [1], although this has yet to be fully implemented in the COC context. The mechanism is similar to that by which <A | R |B> is seen as a dyadic function and R invokes executable code [1]. One purpose of such execution is to redisplay the tag with implicit meanings revealed. or more upper case letters indicate, are used to substitute for confidential, irrelevant, or unknown information. Known information can be replaced. 3.4. Workflow and Job ID Number. In COC, the correct sequence of events, and proper completion of it, is typically important. Fulfilling a prescription is an example given below. In any PCAST UEL-style scenario, there are two particularly practicable possible approaches to this, as well as variants in-between. One can either put information onto tags to control flow, or have each step of tag use initiate the release of a further kind of key, a workflow “number” that has to exist and be used by a subsequent application in order that that the subsequent step can be accomplished. As Fig. 2 indicates, Q-UEL uses the second method. It is potentially more flexible, more secure, and allows a process to be halted or withdrawn by destruction of the key. It is easily extended to a more managed system such as when a QuantalMASTER application [1] is responsible for release and management of keys. One can have some processes for which either one of two (or more) keys from different sources might suffice (OR logic in workflow), and some for which at least two keys are required (AND logic in workflow). When any source is used to generate Q-UEL, a job ID or “job ID number” is also released. It is a particular case of a more general workflow number in a Q-UEL system, required to be available to a task in order to trigger it or authorize its use. It is an integer typically of about 14 digits that is (usually randomly) generated by the preceding task. Used rather like a transient PIN, chance duplication in the system causes no conflicts. It does not usually appear on tags, since having to know it is the means of controlling use. A job ID is generated by disaggregating applications for encrypting and decrypting the keys on the certificate. The job ID is not required if all the keys are not on the certificate but transmitted unencrypted by other means, but workflow control can be retained by using it as a key that irreversibly encrypts data on the ketbra join tag. 3.5. Transformation of Q-UEL Tags by Encryption Operations. In a workflow, the tag released as output from a process in workflow may look quite unlike that which was input to that process. It is not practicable in general to have the detailed instructions of the transformation encoded in the workflow number, or by releasing a new kind of “instruction tag”. It is also not always prudent to have it fully in the hands of the receiving appplication, but rather to let the incoming tag have considerable say as to what transformation is intended. The following is an important aspect for managing this, intrinsic to AML. It is a notation in tag attribute that tells an encrypting application, or part of one, what to encrypt in that tag. When a tag is constructed from source data, what constitutes metadata and what constitutes an encryptable attribute value is adjustable in tags optionally initially generated. If for any attributes on a tag there is no string :=, the metadata operator) the usual default is that the whole attribute is encrypted to, for example, „zC7g9FqY22M7 (encrypted attribute)‟. It does not apply to a reserved set of words that are considered as operators, not attributes. Otherwise, if := is present, only the rightmost terminal node or “leaf” items in the attribute expressed in AML are considered as values for the purpose of encryption. Such a “leaf” item is the string of characters to the right of the last metadata operator := in the branch. When replacing any := by a single equal sign =, that := rule is still followed, forcing some metadata to be seen as part of the value and become encrypted. So sometimes the uninformative attribute metadata:=‟encrypted material (encrypted data)‟ may be seen. In Q-UEL, tag names are special cases of attributes and may contain := and/or = operators. Similar rules apply, but QUEL-X where X is an alphanumeric string is usually not encrypted. 3.6. Data Extension and Fine Grained Consent. If an application were to report to a stakeholder that a patient has a blood glucose concentration of 160 mg/dL it may be the most important piece of information about that measurement, but it is certainly not the only important one. The context is important, but even more intimately involved with that number is when it was measured, the reliability of the measurement that reflects the technique used, and (considering the fundamental importance that Q-UEL places on consent) whether the patient consented the information for research purposes etc. Q-UEL tries as much as possible to deduce and secure from sources such extra “dimensions” of data qualification that it considers important. A fuller format is metadata(units):=value+/-dispersion(time)(consent consent). For quantitative data the data value is formally an expected, most likely, or mean value, and data quality is usually an optional degree of dispersion such as in „systolic blood pressure (mmHg)‟:= 142+/-9CI. The time stamp is usually in Unix/POSIX readable date format for GMT, as in „142+/-9CI(Sat Feb 8 23:56:44 2014 GMT)‟. Since it could help link data ket tags that are illicitly decrypted, it can also be suppressed and presented in a separate time attribute in the same tag, or as a relative attribute time to last such time attribute, e.g. 142+/-9CI(+4:20:00 2014 RAT). By mechanisms used for consent, any of the level of detail of data encrypted or exposed may be modified. Consent is recognized by (consented …) inserted after the data element in the source document. In disaggregated tags one could see, if decrypted, „systolic blood pressure (mmHg)‟:=„142 +/-9CI(Sat Feb 8 23:56:44 2014 GMT) (consented nearest 5 and year)‟. If consented visibly it appears as „systolic blood pressure (mmHg)(nearest 5)‟:=140 +/-9CI(2014) because “nearest 5” is considered as of the nature of units and distinct from the dispersion to subsequent advanced statistical analysis. Note that GMT is the default. This is growing to a rather flexible consent language that can express otherwise, necessarily so because there are subtle pitfalls to both patient privacy and analytics. 3.7. Information Required for Reaggregation. It is evident so far that a significant amount of separated information is required to reaggregate a disaggregated record or other clinical artifact. Again, the following is for the “triple shred” approach. By use of the above information, it provides significant flexibility as to how a system may be set up, and finer control of what the patient and physician would like to happen. For example, a government data miner might be given keys allowing the ability to decrypt and data mine private data tags, but not necessarily to be able to reaggregate and tell which record they came from. Further refinements such as alert and backtrack mechanisms discussed briefly below represent separates “layer” and depend on release of additional kinds of tags. In general and most fundamentally, use of the patient‟s data is controlled by the following. (1) The Keys and their Roles. Whether or not they are carried on a digital certificate, the following keys are required. PW1 or patient password key represents an irreversible encryption of the password assigned to, or chosen by, the patient, and like the other keys on the digital certificate, it is further reversibly encrypted to be decrypted by the Job ID. It contrasts with the other keys on the certificate in that they are usually randomly assigned by the disaggregation software, though this can be overridden. PW1 forms the seed for a sequence evolving generation of a new key from it that is required to query down the next chunkID or bra tag that will enable the data or ket tag to be incorporated into the reaggregating record. As a seed, PW1 requires PW2, the patient password evolver key, as a particular catalyst key to do that irreversible encryption. PW3 is the decrypting key, really two keys entangled: PW3a reversibly encrypts and decrypts the chunkID attribute value, a string isomorphic to the chunkID values on the other tags, and PW3b reversibly encrypts and decrypts the clinical data. PW4 is the starter tag. It is not required in variants of the method where substrings to match on tags are long enough to be unique. It can be envisaged as an dummy tag that queries for the first true tag in the re-aggregation process. (2) The Authority Code Key. Authority code can be a further attribute value slot on the digital certificate to assign authority for the first time or temporarily, but usually it is absent because it is held for a prolonged period by the person or persons it authorizes. This key is seen as weak security, because the same authorization can be assigned to a large number of stakeholders with the same role. However, it could be unique, and can be in part an IP address for the re-aggregator‟s machine. Visible or irreversibly encrypted authority codes can appear in tag attributes, but can also optionally be encrypted along with the encoding of the graph for the isomorphism test in the chunk ID attribute value, or not be present on the tag at all but still required like any other key. (3) The Club. The club is a large group of patients to which a patient has been assigned, such as New York State Veteran, or an arbitrary club. Its primary function is to regulate the speed of reaggregation indirectly by restricting the number of tags to be queried. It is a visible or encrypted value of the club attribute. This club is often known to the authorized stakeholder but can also be passed externally. A patient travelling abroad can be assigned to a temporary club. A club name is an integer, or implies one. In the present report, assigning a starting a run of digits that mathematically patient to club n picks out an n th digit of determines a very large family of possible graphs for that club. One member of that graph family is randomly generated and assigned to a disaggregation task, not a patient. (4) Encryption Level is the level of intensity of encryption applied to all the keys as they are applied successively in the disaggregation and re-aggregation process. There are currently four levels of which level one is least secure and fastest; it is currently normally level 3 that is applied. The level is regarded as a fixed feature of an installation, though it can be reset. 3.8. Disaggregation and Reaggregation by Evolving Keys. The following gives details of the evolving keys and how they work. The disaggregation and reaggregation routines comprise a toolkit and any specific implementation is called the shred configuration. As noted above, the approach used in the present study is the most complicated and slowest configuration that has been explored, but illustrates the broad range of optional features. Bra, ketbra, and ket tags are used (again, the socalled “triple shred”) with the full four separate systems of keys, corresponding to the four quadrants, “key cycles” Ai, Bi, B’i, and Ci, shown in Fig. 3. The upper half of Fig. 3 purely involves reversible encryption RE while the lower half involves irreversible encryption IE. The transformation steps are applied, i.e. “the cycles turns once”, every time a bra, ketbra and ket are conceptually joined. Recall that these components never physically meet up, but query each other via the receiving application. The order of application used here is Ai → B’i → Bi → Ci, repeated for each data ket tag and its associated bra and ketbra. In describing them, however, it is helpful to take them in order of increasing complexity, as follows. (1) The use of cycle Ai at the lower left of Fig. 3 is relatively straightforward as the prototype principle of the disaggregation method. Here, PW1, the encrypted patient‟s password, is the initial string, i.e. seed, acting as a “plaintext” message to which in a chain of irreversible encryption steps are reiteratively applied, one further encryption of the message per cycle. It is the message string evolving from PW1 that queries for a match with a substring on tags. Such substrings arising from irreversible encryption are values of join attributes on tags, and they can be very short substrings, especially of the bra tag, such that illicit reaggregation is confounded by mismatches unless supported by cycle B’i. In reaggregation, the join attributes with their short strings as values are the first “text” hunted by queries in each cycle of searching in an archive of Q-UEL tags. Cycle Ai conceptually joins bra and ketbra tags, since the bra tag effectively queries for the ket tag via the application. However, since the index used to match the string on the ketbra is then processed to match a substring on the next bra in sequence, it also conceptually joins bra to bra. A reaggregating application progressively transforms the message from PW1 and requires PW2 as key for the encryption steps, a key of that kind being dubbed a catalyst key. The above is already a viable algorithm if the catalyst key, here PW2, is unchanging. However, for all the cycles in Fig. 3, they are also described as “seeded by” catalyst keys because catalyst keys can themselves evolve. This makes unauthorized attempts at reaggregation a little more difficult while adding negligibly to the time taken for the process. In the present study, the further key required to do this was simply built from substrings of the “message” evolving from PW1 and the string derived from PW2 from the previous cycle. Fig. 3. (2) At lower right of Fig. 3, cycle Ci joins ketbra to ket via strings as data attribute values on these, though the value of such on the ketbra is irreversibly encrypted. The data ket tag need only contain reversibly encrypted data with its attribute. The match to the corresponding irreversibly encrypted string on the ketbra is a matter of irreversibly encrypting the reversibly encrypted data, essentially the method used to verify passwords behind secure websites without having the original password held on the server. It is essentially like (1) above but starts with a string which is the reversibly encrypted data on the data ket tag. The transformation of this string requires a catalyst key resembling PW2 but normally built from authority code, job ID, and club. This key evolves as does the catalyst key in cycle Ai. Note that at each cycle of querying down the next data ket tag, the string value of the reversibly encrypted data will, of course, change because the data is generally different, but also the catalyst key cycle Bi shown above it in fig. 3, is changing. (3) In the Bi cycle at the upper right of Fig. 3, the first string is the reversible encryption of the source “plaintext” clinical data on the ket data tag attribute, and the key for each irreversible encryption step in cycle Bi is part of PW3 which is really two keys merged. Reversible encryption uses somewhat novel methods, not Perl encryption routines. In “Chaotic + XOR” in Fig. 3, “XOR” actually means a bit (“binary unit”) shuffling algorithm that requires part of PW3 (PW3b) as a catalyst key, this shuffling being combined with an XOR algorithm proper. After each bit shuffle the exclusive disjunction operation using the bitwise logical XOR (“exclusive OR”) operator is applied between every bit of every character in the message evolving from the data on the ket tag and every bit of every character of a specified template key string. So 01 or 10 with 1 in one string corresponding to 0 in the other gives 1, but 00 and 11 give 0. Both bit shuffling and XOR algorithm are reversible, i.e. allow decryption of the data. “Chaotic” signifies the following. The template string is in the present study a compound of authority code, job ID, and club with an integer generated by the Chaotic procedure in that particular cycle. “Chaotic” means that a Chaotic process is emulated but using integers. It is not fundamentally different from generating a chain of pseudorandom numbers. These integers are used in the simulation of disaggregation that enables reaggregation: the template evolves, but what it was at any step can be computed. The “Chaotic + XOR” algorithm is also used in the following. (4) The B‟i cycle at upper is essentially the same the same as for Bi (3) but in this case with a reversible encryption PW4 of the encoding of the graph as a seed, and part of PW3 (PW3a) used as the key to evolve it in the cycle. However, the “plaintext” data revealed by decryption is not (as is the case for Bi) clinical data of interest, but a dummy ID called the Chunk ID, the value of the Chunk ID attribute, different on every bra tag of the record or “chunk” of it selected for transmission. Cycle B’i can be described as helping join each bra tag to the next bra tag to form the “spine” of the record. More precisely, it provides verification that what is being reaggregated by the Ai cycle does indeed belong to the same record or chunk of such that was disaggregated, but this can be essential to reaggregation if the join value on the bra tag is a very short string. Such verification is not dependent on the order of reassembly. Recall that the Chunk ID value is a string encoding a graph that is isomorphic to the graphs of the dummy IDs on the other bra tags for the source record, and the graph implied by the dummy ID can belong to a family of graphs determined by the club: see Section 3.7 (3). The bit shuffling algorithm in this case is supplemented with a procedure that shuffles parts of a graph, in such a way as to provide an interesting feature. Illicit attempts to decrypt the Chunk ID will, at various stages, reveal decoy solutions that look like valid “plaintext”, meaning here not natural language, but rather well-formed valid graphs for that club even though they are not valid for the specific patient. 4. Results 4.1. A Simple Tag Output Example. Several types of tag were written by the above system, the most important being type (1)-(8) as described in section 2.2. As an aid to understanding tag notation it is helpful to start with one of the simplest. Personal medical data is usually stored and transmitted in disaggregated and encrypted form. Some common principles may be made visible by a simple unencrypted consented tag that appears where patients chose to consent certain data. <Q-UEL-CONSENTED-EXTRACT patient:=„7FNcZZ6c(random)' club:=1 | consented jointly | male age:=35(2012) 'BMI(nearest 10)':=20 (2012) „blood pressure‟:=(systolic :=125+/-10CI (2012), diastolic :=70+/-8CI:=(2012)) „Fat(%)(nearest 2)‟ :=10%(2012) Q-UEL-CONSENTED-EXTRACT> String patient:=„7FNcZZ6c(random)' is probably least obvious. It relates to an alert and track-back mechanism optionally consented by the patient, and requiring a special QUEL_ALERT tag as will be described elsewhere. Without that consent and tag, it carries no meaningful information save that the authorization came from a patient. Otherwise, the interpretation should be intuitive, as Q-UEL was designed to be readily readable by humans. Note that single quotes round a metadata or value string are not required unless there is embedded whitespace. That just the year was consented is obvious without further annotation, but for statistical purposes data miners should be advised as to the resolution to which data is re-expressed by the consent. 4.2. Patient Record or Summary Tag with Stakeholders. In contrast to the above example, an EHR or extensive patient summary can be represented by an extremely large tag, so only example content of such a tag, highlighting features that are more peculiar to Q-UEL, are shown below. It is also true that an EHR can be portrayed as list of smaller tags each like that in the above example, or as an XML-like hierarchic structure as still valid Q-UEL [1], but these are not favored formats. Though personal medical data is usually stored and transmitted disaggregated and encrypted, exceptions are allowed in developer applications. The term “stakeholder” (Section 1.1.) is used as attribute metadata, although most common and important stakeholders, e.g., patient, physician, and pharmacist, are recognized nominal-categorical data that can stand without it. Stakeholder attributes also reveal the artifact‟s past and future workflow, and extends the idea from human players involved to software applications and data involved. epSOS HL7 CDA source documents [59] are conveniently rich in stakeholder information (except for the curious omission of the physician in the source document at time of writing). Although the example below is of patient summary type, ultimately the identifying and stakeholder information will be split away from the patient clinical summary (Sections 1.9, 4.2). The emphasis below is on the less familiar stakeholder content. Stakeholder information can be particularly problematic in conserving privacy: a legal guardian is more revealing than a glucose level. Generally, unless perhaps when one is absolutely sure that an example of patient data is contrived, one should guard against risk to patient privacy and other rights, even of a developer. To compare source and the Q-UEL rendering of its stakeholder content, one may (at time of writing this) request authority to access the original source epSOS document, at the S&I eHealth website [25] and should register first with the general S&I initiative [24]. For present reading, Q-UEL masks potentially risky data in such a way that the tags can still be used by the system, and the protections even reversed if authorized, by comment indicated by {!...!} and by replacing data with Q-UEL variables $... Key details below have been de-identified as $AAAAA, $BBBBB etc., Q-UEL variables capable of being reassigned actual or mythical values when authorized. If default operator and is specified, in some applications it optionally triggers a layout that aids reading when there are many attributes, as follows. <Q-UEL-EHR:=‟Patient Summary‟:=(meaning:=www.qexl.org/patient_summary_4/, source:=‟epSOS PS XML‟:= „http://www.google.com/url?q=https://drive.google.com/file/d/0ByAfdYPeAnMejNEeEFZNWhqRjg/edit%3Fusp%3Dsharing &usd=2&usg=ALhdy2_JxPqGnmlQPdD7FbBWi4K8EottYA/‟, „detected source title‟:= Slovenian:=code:=sl-SI:=‟Povzetek pacientovih osebnih podatkov‟, Referrer:=‟Standards and Interoperability Initiative‟:=( http://www.siframework.org, EU-US eHealth‟:= „Work Group Activities‟:= „http://wiki.siframework.org/Interoperability+of+EHR+Work+Group/), author:=‟Barry Robson(Jan 19 10:50:18 2014 GMT)‟ :=(http://www.qexl.org/Barry_Robson_1/, telephone:=(code:=US#):=1-345-945-1082, email:=robsonb@aol.com), comment:=English:=‟Example transcription of epSOS PS XML‟, „Q-UEL words‟:=English:=domain:=(EHR, demographic, stakeholder, LOINC, histories, complaints, diagnoses, prescriptions, procedures, chemistry), „content words‟:=Slovenian:=code:=sl-SI, warning:=nonuse:=(example, handcrafted, unencrypted)) patient:=name(„given then family‟):=„ $AAAAA‟:=‟http://www.qexl.org/SI_Patient_Reg$BBBBBB/‟ and address:=(‟physical address‟:=(country:=SI):=(city:=Ljublijana):=((street:=$CCCCC):=(residence:=$DDDDD), postcode:=1000), telephone:=(+$EEEEE, use:=MV), email:=$FFFFFF) and male and birthdate:= „$GGGGG GMT‟ and speaks:= Slovenian:=code:=sl-SI | has:=‟http://www.qexl.org/has_3/ | stakeholder:=person:=‟primary physician‟ := {! „source data not detected‟ !} and stakeholder:=person:=custodian:= („http://www.qexl.org/SI_ Custodian-Organization_Reg44444/‟, address:=(‟physical address‟:=(country:=SI):=(city:=$HHHHH):=((street:=‟$IIIII‟):=(residence:=10),(postcode:={!likely source error!}), telephone:=(+$JJJJJ, use:=MC)):=person:=name(„initial then family‟):=„$KKKKK ($LLLLL GMT)‟ :=(‟http://www.qexl.org/SI_Doccument-Author_Reg$MMMMM‟/ and stakeholder:=person:=„source document author‟:= organization:=‟ ZD Ljubljana ($NNNNN GMT)‟ :=(„http:// www.qexl.org/SI_Organization_Reg$OOOOOO/‟, address:=(‟physical address‟:=(country:=SI):=(city:=Ljublijana):=((street:=‟ Neka ulica v ljubljani‟):=(residence:= 3 $PPPPP), postcode:=1000), telephone:=(+$QQQQQQ, use:=WP), email:=$RRRRR‟}):=person:=name(„title then given then family‟):=„$SSSSS:=‟http://twww.qexl.org/SI_Doccument-Author_Reg$TTTTTTT‟/ and stakeholder:=person:=„source legal authenticator‟:= organization:=‟ ZD Ljubljana (March {! „source data not interpretable‟:= 2013033000107-sic !} 2013 GMT)‟ :=(„http://www.qexl.org/SI_Organization_Reg340008204600048/‟, address:=(‟physical address‟:=(country:=SI):=(city:=Ljublijana):=((street:=‟ Neka ulica v ljubljani‟):=(residence:= 30b), postcode:=1000), telephone:=(+386557925143, use:=WP), email:= {! „source data missing‟:= UNK-sic !}):=person:=name(„tile then given then family‟):=„Dr. Stefan Pregl‟ :=‟http://www.qexl.org/SI_LegalAuthenticator_Reg540008204600049‟/ and stakeholder:= organization:=‟scoping organization‟:=‟ National institute of public health, Republic of Slovenia‟:=(„http://www.qexl.org/SI_Organization_Reg340008204600048/‟, address:=(‟physical address‟:=(country:=SI):=(city:=Ljublijana):=((street:= Trubarjeva):=(residence:= 2), postcode:= {! „source data missing‟:= UNK-sic !}), telephone:=( +38612441597, use:=WP), email:=mailto:epsos@ivz-rs.si) and stakeholder:=data:=„source document‟:= „http://www.google.com/url?q=https://drive.google.com/file/d/0ByAfdYPeAnM-ejNEeEFZNWhqRjg/edit%3 Fusp%3Dsharing &usd=2&usg=ALhdy2_JxPqGnmlQPdD7FbBWi4K8EottYA/‟:=specifications:= code:=‟XML to QUEL converted‟:=(„xml version‟=1.0, encoding=UTF-8):=(‟ClinicalDocument‟:= moodCode=EVN, classCode=DOCCLIN, xsi:schemaLocation=urn:hl7-org:v3 CDA.xsd, xmlns=urn:hl7-org:v3, xmlns:epsos=urn:epsosorg:ep:medication, (xmlns:xsi:=http://www.w3.org/2001/XMLSchema-instance/):=(typeId extension:=POCD_HD000040, root:=$SSSSSS, „templateId root‟:=$TTTTT, „templateId root‟:=$UUUUU, „id extension‟:=($VVVVV, root:=$WWWW):=(„code displayName‟:=‟Patient Summary‟, codeSystemName=LOINC codeSystem=$XXXXX, code=$YYYYY)) and stakeholder:=data:=‟previous document to source document‟:=‟($ZZZZZ GMT)‟:= code:=XFRM:= „http://www.qexl.org/SI_Organization_Reg$AAAAB‟/ {! A great deal of clinical data here !} and tagtime:=„Oct 10 12:43:20 2002 GMT‟ Q-UEL-PATIENT-SUMMARY> 4.3. A Prescription Artifact. The following example is a detailed record of a prescription event for a (definitely contrived) patient in a MUMPS/VistA source code used as an example by the US Veteran‟s Association. The relator is trigerred. This gives the “richest” example: prior to the prescription being fulfilled, Q-UEL formalism allows just the bra part <…| to carry the triggering event, here as a prescription request, although in practice to do that it uses bra-relationship-ket forms with, for example, will trigger, should trigger, could trigger, and would trigger with distinct meanings as will be described elsewhere. <Q-UEL-PRESCRIPTION:=„order entry and results reporting‟:=(meaning:=www.qexl.org/prescription_3/, source:=‟VistA FMQL‟:= http://vista.caregraf.info/fmql/:= referrer :=‟Tom Munnecke‟:=www.osehra.org/users/tommunnecke/, author:=‟Barry Robson (Sep 21 10:01:18 2013 GMT)‟:= www.qexl.org/Barry_Robson_1/, comment:=‟example transcription of example VistA FMQL entry‟, warning:= (example, handcrafted, unencrypted):=comment:=‟Do not use as input. Hand-crafted for discussion, specification, example, research, development and test purposes only. May contain errors. This example contains RDF-style definitions and above tagname qualification features not in the original source.‟) patient:=„John Smith‟:=www.qexl.org/US_Patient_Reg189958822/ and provider:=(center:=„Outpatient Site FMQL Clinic‟:= www.qexl.org/US_MedCenter_Reg8411/, (physician, prescriber):=‟ James Kildare‟ :=www.qexl.org/US_MD _Reg74356/) and Rx:=(simvastatin:=code:=(NDC:=000006-0749-54, VA:=4010153) :=www.qexl.org/simvastatin/, tablets(number):=90, tablet(mg):=40, „prescriber instruction‟:= (literally:=„Take one tablet by mouth every evening‟, formally:=(tablets:=1 by:=mouth with:=water(presumed) „when (local patient time 24 hour clock)‟:=19.00+/-4) )) and Rx#:=„800018 (Mar 5 09:11:03 2002 local)‟ and fills:=(„earliest possible‟:= „Mar 5 09:11:03 2002 local „, „next possible‟:= „Apr 5 24:00:00 2002 local‟, „last possible‟:= „March 6 24:00:00 2003 local‟) and „patient status‟:=code:=SC:=(„not exempt from copayment‟, „days supply‟:=30 refills:=11, renewable):=www.qexl.org/Verify_Status_US_Patient_Reg189958822_ US_MD _Reg74356_Rx#:=800018/ and order:=initiated and „prescribing status‟:=expired and „GMT minus local time(hours)‟:=7 and zone:=constant | triggered:=www.qexl.org/ triggered_3/ | dispensing:=(ordered:=10, „unit price($)‟:=0.80, available, delivery:=‟window pickup‟) and times:=(login:=‟ Mar 5 13:50:17 2002 local‟, fill:= „Mar 5 13:51:02 2002 local‟, „last dispensed‟:= „Mar 5 14:13:17 2002 local)‟, „label:= „Mar 5 13:50:27 2002 local‟, release:= „Mar 5 13:50:42 2002 local‟)) and copies:=1 and counseling:=(given, understood) and (pharmacist, enterer, printer, counselor):=‟Nancy Devillers‟:=www.qexl.org/ US_Pharmacist_ Reg101740/, and order:=converted and „dispensing status‟:=expired and „refill status‟:=open and „GMT minus local time(hours)‟:=7 and zone:=constant and tagtime:=„Mar 5 20:50:43 2002 GMT‟ Pbwd:=0:=comment:=„Process is not reversible, and forward direction is certain as a matter of record (Pfwd:=1 is the default)‟ Q-UEL-PRESCRIPTION> 4.5. Example Output Typical Disaggregated and Transmitted Forms. We are now equipped to see what transmitted tags look like, and to give further details based on these. Note below the recently included tagtypecheck attribute as a double verification of interpretation, and the day attribute, which allows an optional clearance cycle for the tags without giving much away as to timestamp. Tags with Tue will be cleared from the Cloud next Monday (though the day can be back or forward dated), and so on, after backing up the last weeks cycle of tags on a secure archive. Previously, an optional day could, to same effect, be added to the tag name as, e.g., <Q-UELSHREAD1Tue. The first tag type of the “triple shred” method is Q-UEL-SHRED1, where the 1 relates to the particular process, not the first tag from a disaggregated record. <Q-UEL-SHREAD1 day:=Tue tagtypecheck:=pseudovector:=bra:=chunkID:=‟record spine‟ chunkID:='303e3e3e3031383e3e373e3e3c31313c3c32333c3c30383e323e3e313c363e3c2 d3 93c30323c30333e30313c35303c32303c32323632353e3c3e3c31303c3139303e3c313c3c3238 3e3e3c31373e3c303e3c30343532313e3e31363c34343e3e323e3e3c3c353c313c3e3c32323e3 c303e3c31323931333e37(encrypted chunkID)' club:=1 authority:='bkXodmTAB6ogDsWYIGoPt c (encrypted authority)' join:='jwIqry(encrypted join)' | The remarkably short string length in the join attribute is a (usually arbitrary) substring of the actual index as evolved key with which it is matched. It is efficient as the first test, but it is backed up by the isomorphism test. In contrast, the chunkID above is remarkably long and repetitive, with common symbols. With our current routines that also impedes hacking. When decrypted, it is currently formally a symbolic representation of a graph that is isomorphic to other graphs on chunkIDs for the same record, and it could be set to be a subgraph of such (a slower test). The set of valid isomorphic tree structures is a directed graph randomly derived from a fully connected graph assigned to a club, which at present is not secure when the club is a unique integer because that graph is defined by a run of digits of that the integer indicates. Hence the decryption and isomorphism applies to the scope of that club, and would not conflict with keys etc, assigned to other clubs. Recall that there are also false or decoy solutions for graph representations generated on route in disaggregation. The second type of tag required for the “triple shred” is the Q-UEL-SHRED2 tag. | tagtypecheck:=pseudomatrix:=ketbra:=operator:='join data' join:='5lpgWRM(encrypted join)' to chunkID in club:=1 Q-UEL-SHRED2><Q-UEL-SHRED2 day:=Tue metadata:='History of diabetes':=„ lnCLA FXgw 4.s3D NTSy7p2 2YPKyjpR/UlQ2fBt6fnLJ8Qq9BXHegd0gMrE5Am UlcZuYDU2hVIaOu1Ul8sw Gd jT18YKTgC7.3GPpUCi5JLgNhtxgRfpDVTbGVngBZlT/CM nhAsserL.1mPT2UWBmF87LWl 1QMnCuTw8xcAbBP7ToyzF7wd41jhp8WmOkChebkWpX LDkQhwmqfIxhjYMTOtH9pNexMO 9/5F5LVsFQ0IxKGyrKMrUHckjNBl0H3k (encrypted data)' | These tags do not necessarily carry secure personal health information that would identify the patient because the patient‟s ability to retrieve it is proof of ownership. The third tag type required for the “triple shred” is Q-UEL-SHRED3. The authority key is optional, but it can control access to authorized data miners who can analyze these data tags without having ability to reaggregate any one record. | tagtypecheck:=pseudovector:=ket:=data day:=Tue metadata:='Total cholesterol':= '21b41023b2110191a10981a1d999c1d181a10191898199414b237321032b2b139bc383ab2321 032b03ab09284374716274702465636279707475646024616471692137393831333935333334 383d3132323028267(encrypted data)' club:=1 authority:='YJt/4aDCOcEsSjxUBp4zhonCeQuo ScZxYFDLhS0vv7zEpgX0TZkLa/Q (encrypted authority)' Q-UEL-SHRED3> 4.6. Disaggregation and Reaggregation Performance. Table 1 reports performance results for methods and settings stated in the above text as “usually” or “typically” applied, etc. These settings are very demanding on computation as discussed in regard to Table 2 later below. Table 1 is for the full “triple shred” into bra, ketba, and ket with extensive validation checks including the rate limiting graph isomorphism test relating to the Chunk ID. Column 2 shows the time taken for the overall disaggregation and reaggregation process when fully automated. Note that these are dependent on hardware specified in the last three columns. Disaggregation is slowest; recall that medical data is kept in the disaggregated state. Even for disaggregation, the slowest process is “administrative”, being the high degree of randomization of order of the tags in the tag soup (a file in this case), and was done for every new record in order to provide fair bench marks. Three types of computer were used, an HP Compaq 6200 Pro 64 bit OS with i7-2600 8 x 3.40GHz processors and 4 GB RAM (memory), a DELL Vostro 320 with E5300 2 x 2.6 GHz and 2 GB RAM, and a T340 Thinkpad 64bit OS with i3-2350M 2 x 2.30 GHz and 6 GB RAM. These correspond to H, D, and T in the last column and that order also reflects the age of models (with H, and old HP server, as oldest), not each manufacturers‟ best current machine. There were also problems in balancing the Perl computation across the 8 processors of an old HP server, so ratings give a rather unfair impression of performance on that particular machine. As expected the more recent machines performed better, and one may have good expectations for future improvement as speed evidently does tend to increase with new generations of processors. The rate of generation on latest standard laptops, including the Thinkpad in Table 1, seems generally to be scalable at about 2 seconds per shred (i.e. block of data associated with a metadata item) per 100,000 tags in the mix. The receiving system can be set so that data elements associated each attribute is displayed as it aggregates, so the time to obtain essential emergency information, that can be placed in the early attributes, is important. The time of appearance of the first attribute is shown in Column 3. The benchmarks have been done for tags with a variety of amounts of data in the attributes; on average each disaggregated tag requires about 400 bytes of file space. High resolution DICOM medical images can take some 10-30 times longer to assemble than a typical text record but that depends on the extent to which they are arbitrarily disaggregated, i.e. the bytes per “shred” and number of shreds, as will be discussed elsewhere. Table 1. Bench Marks of the Full Method Described in the Text. Number of tags to search in soup. Manually requested aggregation time (seconds) to show values associated with first metadata block. Overall process: seconds per actionable (readable) metadata displayed per 100,000 tags in same club. Manually requested reaggregation: seconds per actionable (readable) metadata block displayed per 100,000 tags in same club. Platform (hardware etc.) 81 Total time (seconds) for disaggregation, shuffling of tags in soup (time consuming), automated immediate search, and reaggregation of summary record of 27 metadata items. 2 <1 142 <1 H 81 3 <1 142 <1 T 81 3 <1 142 <1 D 402 3 <1 29 <1 T 671 3 <1 34 <1 D 697 2 <1 17 <1 H 1,064 4 <1 6 <1 D 4,589 16 <1 3 <1 D 26,113 3 <1 8 <1 T 119,847 45 <1 3 <1 D 129,862 232 <1 15 6 H 259,297 7946 19 15 9 H 359,416 142 8 2 2 D 469,284 97 8 2 2 T 599,022 254 21 2 3 D 1,987,820 400 27 2 2 T 3,216,347 684 30 2 2 T Table 2 compares the above method as “usually” applied with departures from it. The findings are necessarily preliminary because of the large number of combinations exploring what effects what, and how they interact. In practice, the practical use of the Internet especially for emergency services quickly becomes the rate limiting step if attributes arriving as sets of data elements each correspond to the refreshing of a web page. It is the number of patients in a club that can be managed that is important. Table 2 Maximum Club Sizes (Numbers of Patients in Clubs) with Modifications that Approximately Maintain Current Reaggregation Rates, Based on Preliminary Studies. Q-UEL Method Name. A B C D Methods used on current hand-held or laptop devices using Internet, to maintain time-to-first data element of less than 30 seconds and subsequent displays of elements at less than three seconds each. Divide values in last column by the average number of data elements per record to obtain club size meeting the above requirements for whole records. Method Described in text and Table 1, using one clinical data item per metadata attribute. Pool data into clinical attributes as “shreds” with an average of N items per attribute. To right, n is still about 95% of N because of increase in encryption time, because the graph isomorphism test is rate limiting. Bypass of graph isomorphism test compensated by longer strings on joint attributes produces variously 4-10 times speed improvement. Separation of bra, ketbra and ket, and clubs for each, into separate archives (no great loss to security). Includes estimated effect based on comparison of current Q-UEL experimental encryption methods written in Perl and industrial encryption subroutine performance. Parallel querying of p archives. Clubs, bras, ketbras, kets, and first to Mth parts of record are also put on separate “parallel archives” with M as circa 5. Evolving keys testable as isomorphic in simpler way as graph test, so that assembly order is rendered unimportant: metadata put in required order after reaggregation. Querying for families of isomorphisms, and then using evolving keys to query these for the specific record. Club Sizes (in thousands of patients), expressed this way as practical Internet use can become rate limiting. n= number of attributes pooled per ket data tag, p= number of parallel searches 30-50 n x 30-50 n x 1000-1500 Estimated p x n x 15,000 at least, assuming reasonable parallelization. 5. Discussion and Conclusions Since records are stored and transmitted in the disaggregated state, reaggregation efficiency is important for all COC functionalities of Q-UEL, at least in the current setup. The elaborate method described above with benchmarks as in Table 1 allowed a club of 50,000 patients to be queried to obtain a basic patient summary of some 100 data elements (clinical factors), bundled into an average of 5 per attribute (“one shred” of record), in 1 minute. User controllable simplifications of the method raise the club size to 1-2 million patients. However, reasonable assertions in Table 2 imply that the same record could be obtained in the same time, by plausible modifications, for a club the approximate size of the US. Note that these estimates are still for an elaborate triple shred bra-ketbra-ket, rather than a double shred bra-ket or single shred braket, model, and performed on a standard laptop not interacting with a server. But however achieved, all these assertions depend, of course, on scalability. The current rate (serially querying) is scalable up to at least 3-4 million emulated patient tags (Table 1), and earlier similar configurations reached 7-9 million on 2009 generation Thinkpads though requiring extensive additional machine memory (RAM). Somewhat similar (computational order) querying, associating, graph theoretic, and processing of 6.7 million chemical structure patent records had similar performance on 2006 generation ThinkPads [64]. It is not hard to think of plausible improvements, but the tempting one of using very high performance supercomputers as servers to reaggregate and transmit encrypted results to portals would lose the “entropy protection” in transmission. This brings to mind that even PCAST-style disaggregation is not perfectly secure if tags transmitting to a reaggregating portal are somehow monitored, but that in turn suggests using the last feature of method D (Table 2). Here, an “isomorphism family” is transmitted as a kind of club, reaggregating at the portal the document of interest from that club. Because disaggregation has been made highly automatic, demonstrations are rather unimpressive. In effect, one takes a document, places it in a black box, closes the lid, opens the lid and takes out the document, albeit with a key or keys. To this may be added that any document can be shredded by splitting it into arbitrary elements, including images and scans of documents, spreadsheets, PowerPoint files, and so on. Since one need not look in the black box, it may be questioned as to why one disaggregates data elements by metadata (attributes in AML). The problem of interconverting between standards vanishes, at least as far as protection by disaggregation is concerned. The reason why it is good to “shred by metadata” is for disposing pieces of medical data to authorized stakeholders on a need-to-know basis, and for data mining the private encrypted data by authorized persons. These reflect interpretations of PCAST requirements. At this point, the question might well be raised as to what, if any, of all the above is what PCAST actually wanted. We believe that QUEL is compliant with what PCAST did want but it is indeed a matter of interpretation. Our interpretations do not seem far removed from those of other observers on various blogging sites, but there has been little in depth formal discussion outside of Q-UEL itself. Even the Yosemite Manifesto [19] was rather vague, and although very general “roadmaps” for implementing its proposal have appeared, the status of any subsequent progress regarding them is rather unclear to us at the time of writing [66]. Indeed, the work done and to be done is because PCAST did not define a UEL in detail, essentially saying that “there are ways of doing this” [12]. References 1. B. Robson, T. P. Caruso and U. G. J. Balis, Suggestions for a Web Based Universal Exchange and Inference Language for Medicine, Computers in Biology and Medicine, 43(12) 2297 (2013). 2. I. M. Mullins, I. M., M.S. Siadaty, J. Lyman, K. Scully, G.T. Garrett, G. Miller, R. Muller, B. Robson, C. Apte, C., S. Weiss, I. Rigoutsos, D. Platt, and S. Cohen, Data mining and clinical data repositories: Insights from a 667,000 patient data set, Computers in Biology and Medicine, 36(12) 1351 (2006). 3. http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/ACO/ 4. N. Li, A. F. Laine, H. Jianying, W. Fei , S. Jimeng and S. Ebadollahi, Mining Electronic Medical Records to Explore the Linkage between Healthcare Resource Utilization and Disease Severity in Diabetic Patients, Healthcare Informatics, Imaging and Systems Biology (HISB), IEEE International Conference, 250 (2011). 5. R. A. Greenes (Ed.), Clinical Decision Support, Academic Press (2006). 6. http://www.epic.com/software-intelligence.php (last accessed 2/10/2014). 7. J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Francisco CA: Morgan Kaufmann (1985). 8. B. Robson, Hyperbolic Dirac Nets for Medical Decision Support. Theory, Methods, and Comparison with Bayes Nets, Computers in Biology and Medicine, 51, 183 (2013). 9. P. A. M. Dirac, The Principles of QM, Oxford University Press, Oxford (1930). 10. R. Penrose, The Road to Reality. A Complete Guide to the Laws of the Universe, Joanthan Cape, Random House, London (2004). 11. http://www.healthit.gov/policy-researchers-implementers/meaningful-use-regulations 12. http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-health-itreport.pdf 13. http://en.wikipedia.org/wiki/Semantic_Web (last access 3/30/2013). 14. http://en.wikipedia.org/wiki/Resource_Description_Framework (last accessed 4/10/2013). 15. http://en.wikipedia.org/wiki/Triplestore (last accessed 6/5/2013). 16. B. Buchanan, E.H. Shortliffe, Rule Based Expert Systems. The Mycin Experiments of the Stanford Heuristic Programming Project, Addison-Wesley: Reading, Massachusetts (1982). 17. A. Sninsky, Developing Universal Electronic Medical Records, Gastroenterol. Hepatol. (N Y). Mar 2008; 4(3) 193 (2008), http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3088297/ (last accessed 5/11/2014). 18. N. D. Goodman and D. Lassiter, Probabilistic Semantics and Pragmatics, The Handbook of Contemporary Semantic Theory, Second Edition, Eds. S. Lapin, C. Fox, Chapter 21, Wiley (in production) (2015). 19. http://yosemitemanifesto.org/ (last accessed 7/5/2014). 20. B. Robson, The New Physician as Unwitting Quantum Mechanic: Is Adapting Dirac‟s Inference System Best Practice for Personalized Medicine, Genomics and Proteomics?, J. Proteome Res. (Am. Chem. Soc.), Vol. 6, No. 8: 3114 (2007). 21. http://www.healthbanking.org/index2.html (last accessed 7/25/2014). 22. S. S. Siparasa, JavaScript and JSON Essentials, Packt Publishing (2013). 23. http://json-ld.org/ (last accessed 9/17/2014) 24. http://www.siframework.org/ (last accessed 3/29/2014). 25. http://wiki.siframework.org/Interoperability+of+EHR+Work+Group (last accessed 5/16/2014). 26. http://www.qualityforum.org/WorkArea/linkit.aspx?LinkIdentifier=id&ItemID=68545 (last accessed 3/29/2014). 27. http://semanticweb.org/wiki/Bayes_OWL (last accessed 7/3/2013). 28. http://www.pr-owl.org/basics/bn.php (last accessed 1/25/2014). 29. H. Nottelmann, N. Fuhr pDAML+OIL: A probabilistic extension to DAML+OIL, http://duepublico.uni-duisburg-essen.de/servlets/DerivateServlet/Derivate5571/Nottelmann_Fuhr_04a.pdf (last accessed 7/28/2013) 30. R. D. Appel, H. J. Komorowski, C. E. Barr, E. Charles E, R. A. Greenes, Intelligent Focusing in Knowledge Indexing and Retrieval: The Relatedness Tool, Proc. of the Ann. Symp. on Computer Application in Medical Care 152-157 (1988). 31. S. Meystre, P. J. Haug, Medical Problem and Document Model for Natural Language Understanding, AMIA Annual Symposium Proceedings 2003:455-459 (2003). 32. B. Robson, R. Mushlin, (2004) “Genomic Messaging System for Information-Based Personalized Medicine with Clinical and Proteome Research Applications”, J. Proteome Res. (Am. Chem. Soc.) 3(5); 930-948 (2004). 33. B. Robson and R. Mushlin “The Genomic Messaging System Language Including Command Extensions for Clinical Data Categories” J. Proteome Res. (Am. Chem. Soc.) 4 (2), 275 -299 (2005) 34. Y. Park, R. Yu, L. Rang, W. Hye Won, J. H. Kim, Integrating Microarray Gene Expression Object Model and Clinical Document Architecture for Cancer Genomics Research, AMIA Annual Symposium Proceedings 2005:1073 (2005) 35. Y. R. Park, R. Yu Rang, H. W. Lee, J. H Kim, H. Ju, Integrating Microarray Gene Expression Object Model and Clinical Document Architecture for Cancer Genomics Research, AMIA Annual Symposium Proceedings 2005:1074 (2005). 36. M. Popescu, G. Arthur, OntoQuest: A Physician Decision Support System based on Ontological Queries of the Hospital Database, AMIA Annual Symposium Proceedings 2006:639-643(2006). 37. L. Robu, V. Robu, B. Thirion, An introduction to the Semantic Web for health sciences librarians, J. Medical Library Association, 94(2):198-205 (2006). 38. Q. Xu, Qingwei, Y. Shi Yixiang, Q. Lu, G. Zhang, Q. Luo, Y. Li, Q. Luo, Qingming, Y. Li, GORouter: an RDF model for providing semantic query and inference services for Gene Ontology and its associations, BMC Bioinformatics 9(Suppl 1):S6 (2008). 39. C. Tao, W-Q, Wei, R. H. Solbrig, G. Savova, C. G. Shute, CNTRO: A Semantic Web Ontology for Temporal Relation Inferencing in Clinical Narratives AMIA Annual Symposium Proceedings 2010:787-791 (2010). 40. B. Chisham, Brandon, B. Wright, T. Le, S Trung, C. t. Son, E. Pontelli, CDAOStore: Ontology-driven Data Integration for Phylogenetic Analysis, BMC Bioinformatics 12:98 (2011). 41. S. Liu, B. Zhou, G. Xie, Guotong, J. Mei, H., Liu, Haifeng C. Changsheng, L. Qi, Liang, Beyond Regional Health Information Exchange in China: A Practical and Industrial-Strength Approach, AMIA Annual Symposium Proceedings , 2011:824833 (2011). 42. S. Heymans, M. McKennirey, J. Phillips, Semantic validation of the use of SNOMED CT in HL7 clinical documents, J. of Biomedical Semantics 2:2 (2011). 43. C. Tao, H. R. Solbrig, C. G. Chute, CNTRO 2.0: A Harmonized Semantic Web Ontology for Temporal Relation Inferencing in Clinical Narratives, AMIA Summits on Translational Science Proceedings 2011:64-68 (2011). 44. G. Jiang, H. R. Solbrig, C. G. Chute, ADEpedia: A Scalable and Standardized Knowledge Base of Adverse Drug Events Using Semantic Web Technology, AMIA Annual Symposium Proceedings 2011:607-616 (2011). 45. A. Callahan, M. Dumontier, N. M. Shah, HyQue: evaluating hypotheses using Semantic Web technologies, J. Biomedical Semantics 2(Suppl 2):S3 (2011). 46. V. Mironov, N. Seethappan, W. Blondé, E. Antezana, A. Splendiani, Andrea, M. Kuiper, Gauging triple stores with actual biological data, BMC Bioinformatics 13(Suppl 1):S3 (2012). 47. M-F. Sy, Mohameth-François, S. Ranwez, J. Montmain, A. Regnault, M. Crampes, V. Ranwez, User centered and ontology based information retrieval system for life sciences, BMC Bioinformatics 13(Suppl 1):S4 (2012). 48. I. Sim, Ida, S. Carini, S. W. Tu, L. Detwiler, T. Landon, J. Brinkley, S. A. Mollah, K. Burke, H. P. Lehmann, S. Chakraborty, K. M. Wittkowski, B. H. Pollock, T. M. Johnson, V. Huser, Ontology-Based Federated Data Access to Human Studies Information, AMIA Annual Symposium Proceedings 2012:856-865 (2012). 49. B. Chen, Y. Ding, D. J. Wild, Improving integrative searching of systems chemical biology data using semantic annotation, J. of Cheminformatics 4:6 (2012). 50. J. F. Brinkley, F. James , L. T. Detwiler, T. Landon, A Query Integrator and Manager for the Query Web, J. biomedical informatics , 45(5):975-991 (2012). 51. M. E. Holford, J. P. McCusker, K-H. Cheung, M. Krauthammer, A semantic web framework to integrate cancer omics data with biological knowledge, BMC Bioinformatics 13(Suppl 1):S10 (2012). 52. C. Garcia , L. Jael, C. McLaughlin, A. Garcia, Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data, J. Biomedical Semantics 4(Suppl 1):S5 (2013). 53. J. Kepner, W. Arcand, D. Bestor, B Nergeron, C. Byun, V, Gadepally, M. Hubbell, P. Michaleas, J. Mullen, A. Prout, A. Reuther, A. Rosa, C. Yee, Achieving 100,000,000 database inserts per second using Accumulo and D4M, IEEE High Performance Extreme Computing (HPEC), in press (2014). 54. R. H. Dolin, L. Alschuler, S. Boyer, S., C. Beebe, An update on HL7's XML-based document representation standards, Proceedings of the AMIA Symposium 2000;190-194. 55. R. H. Dolin, L. Alschuler, Liora, C. Beebe, P. V. Biron, S. l. Boyer, D. Essin, E. Kimber, T. Lincoln, Tom, J. E. Mattison, The HL7 Clinical Document Architecture, J. Am. Med. Informatics Association , JAMIA 2001;8(6):552-569 (2001) 56. https://www.progress.com/products/data-integration-suite/data-integration-suitedeveloper-center/data-integration-suite-tutorials/healthcare-applications/convertingfrom-hl7-2x-to-hl7-3x (last accessed 10/22/2014). 57. R. H. Dolin, L. Alschuler, S. Boyer, C. Beebe, F. M. Behlen, P. V. Biron, Paul V. , A. M. HL7 Clinical Document Architecture, Release 2, J.of the Am. Med. Informatics Association, AMIA 2006;13(1):30-39 (2006). 58. https://www.hl7.org/documentcenter/public_temp_425ACAEF-1C23-BA170C6AF8FA2C5E1E32/wg/inm/Acf302.pdf (last accessed 10/22/2014). 59. http://www.hl7.org/implement/standards/product_brief.cfm?product_id=6 (last accessed 10/22/2014). 60. http://www.epsos.eu/ (last accessed 2/10/2014). 61. http://en.wikipedia.org/wiki/VistA (last accessed 2/10/2014). 62. K. C. O‟Kane, The Mumps Programming Language, CreateSpace Independent Publishing Platform (2008). 63. V. Dinu and P. Nardkarni, Guidelines for the Effective Use of Entity-Attribute-Value Modeling for Biomedical Databases, Int J Med Inform. 76(11-12): 769–779 (2007). 64. C. Lovis, A. Lamb, R. Baud, A. M. Rassinoux, P. Fabry, A. Geissbühler, Clinical Documents: Attribute-Values Entity Representation, Context, Page Layout And Communication, AMIA Annual Symposium Proceedings 2003:396-400 (2003). 65. B. Robson, R. Dettinger, A. Peters, and S.K.P. Boyer, Drug discovery using very large numbers of patents: general strategy with extensive use of match and edit operations” J. Computer Aided Molecular Design 25(5):427-41 66. http://www.dataversity.net/semantic-interoperability-future-healthcare-data/ (last accessed 10/22/2014).

RELATED PAPERS

RELATED TOPICS

Log In

Suggestions for a Web based universal exchange and inference language for medicine

Suggestions for a Web based universal exchange and inference language for medicine

Related Papers

RELATED PAPERS

RELATED TOPICS