Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2401.00942v1 [cs.CE] 01 Jan 2024

The Influence of Biomedical Research on Future Business Funding: Analyzing Scientific Impact and Content in Industrial Investments

Reza Khanmohammadi Michigan State University, Computer Science and Engineering, East Lansing, USA khanreza@msu.edu Simerjot Kaur JPMorgan Chase, Artificial Intelligence Research, New York, USA Charese H. Smiley JPMorgan Chase, Artificial Intelligence Research, New York, USA Tuka Alhanai New York University Abu Dhabi, Computer Engineering, Abu Dhabi, UAE Ivan Brugere JPMorgan Chase, Artificial Intelligence Research, New York, USA Armineh Nourbakhsh JPMorgan Chase, Artificial Intelligence Research, New York, USA Mohammad M. Ghassemi Michigan State University, Computer Science and Engineering, East Lansing, USA
Abstract

This paper investigates the relationship between scientific innovation in biomedical sciences and its impact on industrial activities, focusing on how the historical impact and content of scientific papers influenced future funding and innovation grant application content for small businesses. The research incorporates bibliometric analyses along with SBIR (Small Business Innovation Research) data to yield a holistic view of the science-industry interface. By evaluating the influence of scientific innovation on industry across 10,873 biomedical topics and taking into account their taxonomic relationships, we present an in-depth exploration of science-industry interactions where we quantify the temporal effects and impact latency of scientific advancements on industrial activities, spanning from 2010 to 2021. Our findings indicate that scientific progress substantially influenced industrial innovation funding and the direction of industrial innovation activities. Approximately 76% and 73% of topics showed a correlation and Granger-causality between scientific interest in papers and future funding allocations to relevant small businesses. Moreover, around 74% of topics demonstrated an association between the semantic content of scientific abstracts and future grant applications. Overall, the work contributes to a more nuanced and comprehensive understanding of the science-industry interface, opening avenues for more strategic resource allocation and policy developments aimed at fostering innovation.

Refer to caption
Figure 1: The bidirectional flow of breakthroughs between academia and the industry. Above the timeline, blue boxes represent scientific discoveries that have later catalyzed commercial applications, as indicated by the arrows leading to red boxes. Below the timeline, red boxes mark industry innovations that have subsequently inspired scientific exploration and advancement, traced back to blue boxes. The interchanging paths underscore the dynamic exchange between scientific inquiry and industrial application across different fields.

Introduction

It is widely believed that scientific innovation influences present-day or future industrial activities; this influence is especially pronounced in the biomedical and health sciences, where scientific validation is often a pre-requisite to commercial translation [1]. CRISPR-Cas9 [2] is one recent example, illustrated in Figure 1, that demonstrates how scientific innovation can influence the direction of industrial activity. Originating from basic biological research into the immune systems of bacteria, CRISPR-Cas9 has impacted the biomedical industry by introducing a more efficient and precise method for gene editing; this in-turn has led to numerous downstream industrial applications, including more effective gene therapies [3]. However, the magnitude and time-scale of a given scientific innovation’s impact on downstream industrial activities can vary dramatically by research topic, and a host of other complex forces including government regulation, economic conditions, and societal perceptions. For instance, research on the role of Telomere decomposition in the aging process has provided important insights into how and why biological organisms age [4], but has not (yet) resulted in downstream industrial applications. To date, a detailed characterization of the magnitude and time-scale of scientific activities’ impact on downstream industrial activities is missing — addressing this gap is the primary objective of this research paper.

Understanding the relationship between scientific innovation and industrial activity has substantial practical implications. Identifying scientific areas with a significant impact on the industry will enable businesses to anticipate future trends and adapt investment strategies accordingly. Conversely, when researchers are familiar with the commercial implications of their research, they are more likely to conduct research that has a high chance of being translated into practical applications. For example, knowing the industrial relevance of developing more effective antibiotics can guide biomedical researchers to focus their efforts on this area, potentially resulting in improved public health outcomes and opportunities for economic growth in the pharmaceutical sector.

Research Questions

Our key objective is to clarify how present areas of scientific innovation was associated with future small business commercialization activities and funding within the same innovation areas. More specifically, this study aims to answer two research questions:

  1. 1.

    For a given topic in the biomedical sciences, can the historical impact of scientific papers be a leading indicator of future funding allocations to small businesses active in those topics?

  2. 2.

    For a given topic in the biomedical sciences, can the historical content of scientific abstracts be a leading indicator of the future content of innovation grant applications for small businesses active in those topics?

If these research questions can be answered in the affirmative, it implies several opportunities for additional research and development. For example, if historical trends in scientific activities are associated with future funding allocations to small businesses working on associated topics, then it may be possible to forecast future (unknown) industrial trends based on current (known) scientific activities. Insofar as this forecasting can be performed with fidelity, this understanding can guide resource allocation, investment strategies, and inform policy developments intended to foster innovation.

Related Work

The relationship between scientific innovation and industry is crucial for technological and economic progress. The U.S.’s proposed $191 billion funding for research and development in 2023 [5] shows the importance of science in driving industrial growth. To understand this complex relationship, researchers often use bibliometric and patent analyses, which provide objective views on the ongoing interaction between science and industry.

Bibliometric Analysis

Bibliometric analysis is a method often used to delve into scientific literature and detect evolving trends, providing a way to illustrate the connections between science and industry [6]. This approach has been utilized in various studies to investigate aspects such as university-industry collaborations [7]. It has also been used to examine the scientific output in the publishing industry, identifying leading academic publications, top authors, and primary research countries [8]. Notably, these studies highlighted the US, UK, Spain, and China as leading nations in scientific output. Further research identified central themes like cyber-physical systems and cloud computing in the Industry 4.0 research field [9]. Thus, bibliometric analysis facilitates a deeper understanding of scientific trends, their potential applications, and how they shape and are shaped by industrial development and socio-economic influences.

Patent Analysis

While bibliometric analysis provides insights into scientific trends and their influence on industry, it alone cannot fully capture the complexity of industrial activities their evolution. To complement this approach, patent analysis offers a focused examination of industry-related intellectual property and technological advancements [10]. For example, a study on Carbon Capture, Utilization, and Storage (CCUS) technology revealed its rapid development since 2013, with patents concentrated in China, the US, and Japan, particularly in the energy and electricity sectors [11]. Moreover, patent analysis enhances our understanding of industry-focused aspects. For instance, a scientometric study on Smart Cities showed that research predominantly focuses on social aspects, while related technologies emphasize specific technical solutions, often overlooking the role of citizens [12]. Wang and Li[13] also highlighted the significant impact of high-quality academic research on patent development in the realm of nanotechnology, using data from nano patents and their associated citations. They also shed light on the variety in citation patterns, influenced by factors like organizational type and origin of knowledge, and suggest that a broader scientific scope does not necessarily translate into higher patent quality. By incorporating patent analysis alongside bibliometric analysis, researchers can gain a more comprehensive view of the intricate relationship between scientific progress and industrial evolution [14].

Empirical Insights from PubMed papers and SBIR awards

PubMed, a renowned platform for biomedical research, has been at the forefront of capturing evolving scientific trends. Recent studies on this platform cover a wide array of specific biomedical topics such as Kawasaki disease [15], COVID-19 [16], protein engineering [17], and spine surgery [18], as well as broader public health topics such as water quality [19] and vaccine clinical trials [20]. These in-depth analyses offer a window into the current state and gaps in medical research, highlighting areas of priority. The SBIR program, on the other hand, offers specific insights into technological advancements and commercialization strategies within the industry sector. Audretsch et al.[21]’s exploration of SBIR-awarded projects between 1992 and 2001 underscores the value of intertwining university expertise with industry-driven goals. In addition to enhancing the quantity of scientific papers, this intersection enhances the richness of technological advances. Accordingly, Hayter and Link[22]’s analysis of 1,180 SBIR-endorsed firms emphasizes the complementary role of patenting and publishing as intertwined components of technology commercialization. Despite their valuable individual contributions, PubMed studies and SBIR projects offer largely untouched opportunities to explore the science-industry relationship more comprehensively.

Gaps in the Literature

Despite the insightful findings of these studies, several gaps persist in the literature:

  • The diversity across the number of topics addressed in these studies has been rather limited.

  • While patents provide a glimpse into technological advancements, they do not fully capture the nuances of businesses that operate on these technologies, nor do they directly reflect financial implications.

  • The semantic content of scientific and industrial activities has not been adequately accounted for in previous research.

  • The temporal aspect of the evolution, which could be viewed as a time series analysis, remains largely unexplored, as does the time it takes for the impact of science to materialize in industry.

Contributions of this work

Our contributions in this work provide a more comprehensive understanding of the science-industry interface, helping to bridge existing gaps in the literature. We provide:

  • Broad assessment of biomedical research topics: We explore the effect of scientific innovation on industrial activities within 10,873 biomedical topics; this is the most comprehensive exploration of its kind, to the best of our knowledge.

  • Accounting for topical taxonomic relationships: The topics investigated are hierarchically structured; thus, we account for the taxonomic relationship and rank of the topics. By accounting for these taxonomic properties, our analysis also clarifies if scientific innovation influences very specific industrial activities (e.g. CRISPR-Cas9), or broader industrial trends (e.g. Genetic Phenomenon).

  • Assessment of temporal effects and impact latency: We examine the temporal evolution of scientific innovation’s influence on industry, analyzing the latency of its impact from 2010 to 2021.

  • Diverse analytical approaches: We employ various methods to understand the science-industry connection, including correlation and causality studies, as well as analyzing content overlaps between scientific and industrial texts.

We hope that our study bridges the gap that exists in the literature between analyzing scientific research and probing the trends in industry investments.

Methods

In the Methods section, we outline the data sources and analytical approaches employed to address the research questions outlined earlier. We detail the bibliometric and industrial innovation datasets used, describe the taxonomy for categorizing biomedical topics, and present our methodologies for assessing the relationships between scientific interests and industry funding, as well as the semantic content of academic and industrial texts.

Data

Data Sources

All data for this study were publicly available; they were sourced from: (1) the abstracts and metadata of 10,928,078 scientific publications in PubMed [23] and, (2) the project abstracts and award amounts of 63,488 Small Business Innovation Research (SBIR) grants [24]. Both assets were relevant to the time period spanning 2010 to 2021.

Selected Topic Taxonomy

To analyze the impact of scientific activity on industrial innovation across a consistent set of topics, we utilized the Medical Subject Headings (MeSH) taxonomy (https://www.ncbi.nlm.nih.gov/mesh/). MeSH is a controlled and hierarchically organized vocabulary produced by the National Library of Medicine that facilitates indexing, cataloging, and searching for biomedical and health-related information. MeSH terms are organized in a tree-like structure, with more general terms at higher levels and more specific terms at lower levels.

The MeSH ontology consists of more than 29k terms, a large proportion of which did not occur in any SBIR abstract from 2010 to 2021. This absence can be largely attributed to the high specificity of many MeSH terms, which do not align with the more general, high-level language and vocabulary typically employed in SBIR abstracts. For instance, "Vibrio vulnificus" is a MeSH term that represents a specific species of bacteria. Given its high degree of specificity, it is unlikely to appear in SBIR abstracts. However, its higher-level parent term in the MeSH hierarchy, "Bacteria", is much more general, and therefore more likely to be appear in SBIR abstracts. To create a more manageable and relevant dataset for our investigation, we reduced the number of MeSH terms to 10,873. This revised subset only included topics that were present at least once in the SBIR abstracts between 2010 and 2021, providing a more effective scale for exploring the science-industry relationship.

Topic Annotation Approach

The papers collected from PubMed include human-generated MeSH annotations; however, the SBIR awards were not annotated for their topical contents (MeSH or otherwise). We generated the missing MeSH annotations for the SBIR awards by applying ScispaCy [25] to the SBIR abstracts, extracting Unified Medical Language System (UMLS) [26] labels with a confidence score exceeding 90%, and converting the UMLS topics to their corresponding entires within MeSH using the UMLS API (https://documentation.uts.nlm.nih.gov/rest/home.html). An example scientific paper and industrial project are shown in Figure 2, along with their respective shared MeSH terms.

Methods for Research Question 1

For a given topic in the biomedical sciences, the goal of our first research question is to understand if the the historical incidence or impact of scientific papers (measured by citations) can be a leading indicator future funding allocations to small businesses working on the same topics. To explore the relationship between science and industry for our first research question, we develop metrics that capture the incidence and impact of scientific publications as well as the investment trends in SBIR grants. We apply these metrics to both scientific publications and industrial data within each MeSH topic and track them over time. This allows us to examine the association between scientific advancements and subsequent funding allocations to small businesses within the same topics.

For each topic, we represent the scientific and industrial data in two ways: (1) as the normalized annual frequencies of published PubMed papers and awarded SBIR grants related to that topic and, (2) as the normalized citation counts of published PubMed papers and cumulative funding awarded through SBIR grants related to that topic. Formal details of the approach are detailed below.

Refer to caption
Figure 2: Illustration of shared MeSH term labels between a PubMed paper [27] and an SBIR project description (https://www.sbir.gov/node/1327383).

Representing Data as Paper and SBIR Award Frequency

We represent scientific activity ( fPsuperscript𝑓𝑃f^{P}italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) as a min-max normalized timeseries of the total annual number of papers published on a given topic ( m𝑚mitalic_m ) or its children topics ( mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) in MeSH. We represent industrial activity ( fSsuperscript𝑓𝑆f^{S}italic_f start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) as a min-max noramalized timeseries of the total annual number of SBIR grants awarded to small businesses working on those topics, or their children topics in MeSH. In Equation 1, we formally denote how fPsuperscript𝑓𝑃f^{P}italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and fSsuperscript𝑓𝑆f^{S}italic_f start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT signals are generated from the set of all PubMed abstracts P(t)𝑃𝑡P(t)italic_P ( italic_t ) and SBIR abstracts S(t)𝑆𝑡S(t)italic_S ( italic_t ):

fX(t,m)=1Nx(t)jX(t)Hj(m)superscript𝑓𝑋𝑡𝑚1superscript𝑁𝑥𝑡subscript𝑗𝑋𝑡superscript𝐻𝑗𝑚f^{X}(t,m)=\frac{1}{{N^{x}(t)}}\sum_{j\in X(t)}H^{j}(m)italic_f start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_t , italic_m ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_X ( italic_t ) end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_m ) (1)

Where: X{P,S}𝑋𝑃𝑆X\in\{P,S\}italic_X ∈ { italic_P , italic_S }; t𝑡t\in\mathbb{Z}italic_t ∈ blackboard_Z, 2010t20212010𝑡20212010\leq t\leq 20212010 ≤ italic_t ≤ 2021; Nx(t)superscript𝑁𝑥𝑡N^{x}(t)italic_N start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ( italic_t ) represents the total number of (PubMed or SBIR) abstracts in year t𝑡titalic_t and

Hj(m)=𝐇𝐞𝐚𝐯𝐢𝐬𝐢𝐝𝐞(1+i{m,mc}δ(j,i))superscript𝐻𝑗𝑚𝐇𝐞𝐚𝐯𝐢𝐬𝐢𝐝𝐞1subscript𝑖𝑚subscript𝑚𝑐𝛿𝑗𝑖H^{j}(m)=\mathbf{Heaviside}\left(-1+\sum_{i\in\{m,m_{c}\}}\delta(j,i)\right)italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_m ) = bold_Heaviside ( - 1 + ∑ start_POSTSUBSCRIPT italic_i ∈ { italic_m , italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_δ ( italic_j , italic_i ) ) (2)

Where: δ(j,i)𝛿𝑗𝑖\delta(j,i)italic_δ ( italic_j , italic_i ) is a function that returns 1 if abstract j𝑗jitalic_j is on topic i𝑖iitalic_i and 0 otherwise. fPsuperscript𝑓𝑃f^{P}italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and fSsuperscript𝑓𝑆f^{S}italic_f start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT were further normalized within each topic (i.e. across time) using Min-Max scaling (i.e. rescaling the values of the signals into the range of 0 to 1). We denote f~Psuperscript~𝑓𝑃\tilde{f}^{P}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and f~Ssuperscript~𝑓𝑆\tilde{f}^{S}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT as the Min-Max normalized versions of fpsuperscript𝑓𝑝f^{p}italic_f start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and fssuperscript𝑓𝑠f^{s}italic_f start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPTs respectively:

f~X(t,m)=fX(t,m)minfX(:,m)maxfX(:,m)minfX(:,m)superscript~𝑓𝑋𝑡𝑚superscript𝑓𝑋𝑡𝑚superscript𝑓𝑋:𝑚superscript𝑓𝑋:𝑚superscript𝑓𝑋:𝑚\tilde{f}^{X}(t,m)=\frac{f^{X}(t,m)-\min f^{X}(:,m)}{\max f^{X}(:,m)-\min f^{X% }(:,m)}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_t , italic_m ) = divide start_ARG italic_f start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_t , italic_m ) - roman_min italic_f start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( : , italic_m ) end_ARG start_ARG roman_max italic_f start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( : , italic_m ) - roman_min italic_f start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( : , italic_m ) end_ARG (3)

Representing Data as Paper and Funding Impact

We explored an alternative representation of the scientific activity ( gPsuperscript𝑔𝑃g^{P}italic_g start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) as a quartile-quantized timeseries of the total citation count of all papers on the topic or its children topics in MeSH. We explored an alternative representation of the industrial activity ( gSsuperscript𝑔𝑆g^{S}italic_g start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) as a quartile-quantized timeseries of the total funding amount (in dollars) allocated to small businesses working on those topics, or their children topics in MeSH. In Equation 3 below, we formally denote how gPsuperscript𝑔𝑃g^{P}italic_g start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and gSsuperscript𝑔𝑆g^{S}italic_g start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT are generated from the PubMed and SBIR data:

gX(t,m)=1KX(t)jX(t)𝒬m[c(j)Hj(m)]superscript𝑔𝑋𝑡𝑚1superscript𝐾𝑋𝑡subscript𝑗𝑋𝑡subscript𝒬𝑚delimited-[]𝑐𝑗superscript𝐻𝑗𝑚g^{X}(t,m)=\frac{1}{{K^{X}(t)}}\sum_{j\in X(t)}\mathcal{Q}_{m}\left[c(j)\,H^{j% }(m)\right]italic_g start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_t , italic_m ) = divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_X ( italic_t ) end_POSTSUBSCRIPT caligraphic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_c ( italic_j ) italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_m ) ] (4)
Refer to caption
Figure 3: This figure illustrates the calculation of the Cross-Correlation Area Under the Curve (CCAUC) ratio, a measure of the lead-lag relationship between scientific advancements and industrial activities. The line graphs on the left side, with science represented by blue lines and industry by red lines, track the sum of impact scores over time (note that the line graphs are schematic representations and do not depict actual trends; they are included solely for illustration purposes to demonstrate the conceptual framework). In the top cross-correlation plot, a rightward skew with a larger blue area indicates scenarios where scientific trends precede industrial ones, yielding a greater CCAUC ratio. The bottom plot shows the inverse, with a leftward skew and a more pronounced red area where industry leads science. The CCAUC ratio itself is derived by dividing the positive-lagged area (science leading) by the negative-lagged area (industry leading) under the cross-correlation curve. This distinction between line graphs and cross-correlation plots highlights not only the direction but also the temporal lag and the quantitative extent of impact between scientific research and industrial application.

Where X{P,S}𝑋𝑃𝑆X\in\{P,S\}italic_X ∈ { italic_P , italic_S }; c(j)𝑐𝑗c(j)italic_c ( italic_j ) is a function that returns the total number of citations if j𝑗jitalic_j is a scientific abstract, and the award amount if it is a project description. Given 𝒜𝒜\mathcal{A}caligraphic_A as m𝑚mitalic_m’s parent at the highest level of the hierarchy, 𝒬msubscript𝒬𝑚\mathcal{Q}_{m}caligraphic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a function that converts the raw citation count to its corresponding quartile position when considering the distribution of citations within a given time-step, across topics {𝒜,𝒜c}𝒜subscript𝒜𝑐\{\mathcal{A},\mathcal{A}_{c}\}{ caligraphic_A , caligraphic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }, Lastly,

KX(t)=jX(t)𝒬m[c(j)]superscript𝐾𝑋𝑡subscript𝑗𝑋𝑡subscript𝒬𝑚delimited-[]𝑐𝑗{K^{X}(t)}=\sum_{j\in X(t)}\mathcal{Q}_{m}\left[c(j)\right]italic_K start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_X ( italic_t ) end_POSTSUBSCRIPT caligraphic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_c ( italic_j ) ] (5)

is a citation normalization constant where m𝑚mitalic_m is the topic of abstract j𝑗jitalic_j. gPsuperscript𝑔𝑃g^{P}italic_g start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and gSsuperscript𝑔𝑆g^{S}italic_g start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT were further normalized within each topic (i.e. across time) as g~Psuperscript~𝑔𝑃\tilde{g}^{P}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and g~Ssuperscript~𝑔𝑆\tilde{g}^{S}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT using Min-Max scaling following the same approach described in Equation 3 above.

Measurement of trend association

Cross-correlation (CC) is a measure of similarity between two signals as a function of a time-lag applied to one of them. In the context of this paper, cross-correlation was applied to each pair of scientific and industrial representations defined in Sections "Representing Data as Paper and SBIR Award Frequency" and "Representing Data as Paper and Funding Impact". More specifically, for all topics, we computed the cross correlations f~P(t,m)*f~S(tτ,m)superscript~𝑓𝑃𝑡𝑚superscript~𝑓𝑆𝑡𝜏𝑚\tilde{f}^{P}(t,m)*\tilde{f}^{S}(t-\tau,m)over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_t , italic_m ) * over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_t - italic_τ , italic_m ), and g~P(t,m)*g~S(tτ,m)superscript~𝑔𝑃𝑡𝑚superscript~𝑔𝑆𝑡𝜏𝑚\tilde{g}^{P}(t,m)*\tilde{g}^{S}(t-\tau,m)over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_t , italic_m ) * over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_t - italic_τ , italic_m ) where τ𝜏\tauitalic_τ represents the number of years the industrial signal was shifted and varied from -11 to 11 (inclusive).

Lags in CC analysis are crucial for understanding temporal relationships between trends. For instance, if the industrial frequency trend (f~Ssuperscript~𝑓𝑆\tilde{f}^{S}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT) for a specific MeSH term peaks a few years after the same trend in the scientific domain (f~Psuperscript~𝑓𝑃\tilde{f}^{P}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT) suggests a pattern where industrial trends are informed by prior scientific work in that term.

For each topic, we computed a single measure that denoted if the scientific activity was more likely to be leading industrial activity than vice versa. This measure was the ratio of: (1) the cumulative cross correlation for all positive τ𝜏\tauitalic_τ and the cumulative cross correlation for all negative τ𝜏\tauitalic_τ. In Equations 6 and 7, we formally denote how this Cross-Correlation Area Under the Curve (CCAUC) ratio was computed for the signal representation pairs defined in Sections "Representing Data as Paper and SBIR Award Frequency" and "Representing Data as Paper and Funding Impact" respectively:

CCAUCf(m)=1+τ=011f~P(t,m)*f~S(tτ,m)1+τ=110f~P(t,m)*f~S(tτ,m)subscriptCCAUC𝑓𝑚1superscriptsubscript𝜏011superscript~𝑓𝑃𝑡𝑚superscript~𝑓𝑆𝑡𝜏𝑚1superscriptsubscript𝜏110superscript~𝑓𝑃𝑡𝑚superscript~𝑓𝑆𝑡𝜏𝑚\text{CCAUC}_{f}(m)=\frac{1+\sum_{\tau=0}^{11}\tilde{f}^{P}(t,m)*\tilde{f}^{S}% (t-\tau,m)}{1+\sum_{\tau=-11}^{0}\tilde{f}^{P}(t,m)*\tilde{f}^{S}(t-\tau,m)}CCAUC start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_m ) = divide start_ARG 1 + ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_t , italic_m ) * over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_t - italic_τ , italic_m ) end_ARG start_ARG 1 + ∑ start_POSTSUBSCRIPT italic_τ = - 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_t , italic_m ) * over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_t - italic_τ , italic_m ) end_ARG (6)
CCAUCg(m)=1+τ=011g~P(t,m)*g~S(tτ,m)1+τ=110g~P(t,m)*g~S(tτ,m)subscriptCCAUC𝑔𝑚1superscriptsubscript𝜏011superscript~𝑔𝑃𝑡𝑚superscript~𝑔𝑆𝑡𝜏𝑚1superscriptsubscript𝜏110superscript~𝑔𝑃𝑡𝑚superscript~𝑔𝑆𝑡𝜏𝑚\text{CCAUC}_{g}(m)=\frac{1+\sum_{\tau=0}^{11}\tilde{g}^{P}(t,m)*\tilde{g}^{S}% (t-\tau,m)}{1+\sum_{\tau=-11}^{0}\tilde{g}^{P}(t,m)*\tilde{g}^{S}(t-\tau,m)}CCAUC start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_m ) = divide start_ARG 1 + ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_t , italic_m ) * over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_t - italic_τ , italic_m ) end_ARG start_ARG 1 + ∑ start_POSTSUBSCRIPT italic_τ = - 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_t , italic_m ) * over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_t - italic_τ , italic_m ) end_ARG (7)

A CCAUC ratio exceeding 1 implies that the industrial trend was more likely to have lagged the scientific trend than vice versa. Conversely, a ratio below 1 suggests the reverse — an industrial trend precluding its scientific counterpart. A ratio equal to 1 implies no time-lagged relationship between the trends. Thus, the CCAUC ratio provides us with a single measure to study if scientific activity was more likely to be leading industrial activity. In Figure 3, we provide an illustrative depiction of the CCAUC Ratio. In essence, the CCAUC ratio indicates the temporal delay and the directional correlation between scientific discoveries and industrial applications, highlighting the sequence and magnitude of their correlation. It offers a nuanced view of the chronological interconnection that shapes the trajectory of advancements across these domains.

Assessment of topic hierarchy on trend association

Our study acknowledges the complexity inherent in translating scientific discoveries into industrial applications, which varies not only in pace but also in the level of detail. To accurately reflect this diversity, we adopt a hierarchical analysis approach that examines the interplay between scientific and industrial trends across different levels of the MeSH taxonomy. By progressively navigating from broader categories to more specialized ones, we aim to illuminate the varying degrees of influence that scientific research exerts on industrial activity, from general trends to niche advancements. Given that the MeSH tree is composed of 13 layers, our traversal process involved a step-wise descent into each successive breadth level. Within each level, we incorporated all topics that are nested from the tree’s root to our current depth. For every topic, denoted as m𝑚mitalic_m, we calculated CCAUC(m)𝑚(m)( italic_m ) and determined the proportion of these values that exceeded 1, as well as those equal to or less than 1. Furthermore, we computed the Maximum Cross-Correlation (MCC) lag for each topic. This enabled us to determine the temporal change at which the correlation between scientific and industrial trends of m𝑚mitalic_m reached its maximum prominence.

To establish confidence intervals at each depth, we iteratively computed the CCAUC ratios for various subsets of scientific and industrial trends. Using a sliding window parameter, 𝔴𝔴\mathfrak{w}fraktur_w, which ranges from 1 to 11 years, we strategically select subsets of time series, f2021𝔴:2021Xsubscriptsuperscript𝑓𝑋:2021𝔴2021f^{X}_{2021-\mathfrak{w}:2021}italic_f start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2021 - fraktur_w : 2021 end_POSTSUBSCRIPT and g2021𝔴:2021Xsubscriptsuperscript𝑔𝑋:2021𝔴2021g^{X}_{2021-\mathfrak{w}:2021}italic_g start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2021 - fraktur_w : 2021 end_POSTSUBSCRIPT. This selection process enables us to observe the evolution of the CCAUC ratio distribution in relation to varying values of 𝔴𝔴\mathfrak{w}fraktur_w. Consequently, the error bounds computed are representative of the standard deviation of the corresponding CCAUC ratios at each depth level.

Measurement of trend causality

Granger Causality (GC) is a statistical approach that assesses whether changes in one time series can predict changes in another. It’s particularly useful in understanding potential relationships between two evolving trends across different lags. In this research, we applied GC to investigate the connections between scientific and industrial domains, as described in Sections "Representing Data as Paper and SBIR Award Frequency" and "Representing Data as Paper and Funding Impact". For each topic, we utilized the chi-squared GC test to determine (1) the extent to which current normalized annual frequencies of scientific papers (f~Psuperscript~𝑓𝑃\tilde{f}^{P}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT) can be indicative of future normalized frequencies of SBIR grants (f~Ssuperscript~𝑓𝑆\tilde{f}^{S}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT) associated with the same topic, and (2) the degree to which the current normalized citation counts of scientific papers (g~Psuperscript~𝑔𝑃\tilde{g}^{P}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT) serve as predictors for subsequent cumulative funding allocated through SBIR grants for that topic (g~Ssuperscript~𝑔𝑆\tilde{g}^{S}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT). The outcome of this test, represented as a p-value, elucidates the statistical significance of the causal relationships between the paired signals.

Assessment of topic hierarchy on trend causality

Leveraging the hierarchical methodology from Section "Assessment of topic hierarchy on trend association", we traversed the MeSH tree to measure GC between scientific and industrial trends. For each topic, m𝑚mitalic_m, we determined the causality’s significance by calculating its p-value and gauged the proportion below 0.05. Employing time lags up to 11 years for each analysis, we traced the distribution of GC significance across MeSH depths. The computed error bounds reflect the yearly standard deviation of these significant p-value ratios for each layer.

Methods for Research Question 2

For a given topic in the biomedical sciences, the goal of our second research question is to understand how the historical content of scientific abstracts can be a leading indicator of the future content of innovation grant applications for small businesses working on those topics.

Representing the data as text embeddings

To answer our second research question, we investigated temporal associations between the semantic content of scientific papers and SBIR abstracts. More specifically, for all PubMed and SBIR abstracts in a given topic, we utilized the E5 (https://huggingface.co/intfloat/e5-large) embedding model [28] recognized for its state-of-the-art performance in text representation, to generate text embeddings, and studied their associations over time. Formally, given MeSH term m𝑚mitalic_m and (ypsubscript𝑦𝑝y_{p}italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, yssubscript𝑦𝑠y_{s}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) as a pair of years ranging from 2010 to 2021, we select all scientific and industrial papers labeled with m𝑚mitalic_m in years ypsubscript𝑦𝑝y_{p}italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and yssubscript𝑦𝑠y_{s}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, respectively. We then embed the corresponding scientific abstracts and industrial project descriptions of those papers into a 1024-dimensional space. Next, we transform these high-dimensional embeddings into a two-dimensional space using Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) [29] — noted for its efficiency with large datasets and preservation of global structure compared to t-SNE [30] — for capturing linear and non-linear structures inherent in high-dimensional embeddings, respectively. We discretize this two-dimensional space, setting the number of bins along each dimension to b1,2=subscript𝑏12absentb_{1,2}=italic_b start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT = 20, yielding a 20x20 matrix. Each cell of this matrix represents a unique region in the semantic context space.

Refer to caption
Figure 4: For each year pair, scientific and industrial abstracts are spatially mapped and normalized for similarity calculations. Each pair of embeddings generates a unique similarity value for its respective position in the grid. The contextual similarity between current scientific abstracts and future industrial project descriptions populates in the upper triangle, and vice versa in the lower triangle. The triangular ratio (tr𝑡𝑟tritalic_t italic_r), representing the influence of scientific context on future industrial projects, is the cumulative sum of similarities in the upper to lower triangle.

For each bin in the grid, we sum the total quartile-quantized citations of all scientific abstracts and the total quartile-quantized award amounts of all industrial abstracts that fall into it using the approach from Section "Representing Data as Paper and Funding Impact". Given Dxysubscript𝐷𝑥𝑦D_{xy}italic_D start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT as the set of embedding points discretized at bins x and y of the b1×b2subscript𝑏1subscript𝑏2b_{1}\times b_{2}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT grid, we calculate the density of that point for scientific and industrial abstracts using Equation 8:

M¯¯b1×b2X(t,m)=jdxy𝒬m[c(j)]subscriptsuperscript¯¯𝑀𝑋subscript𝑏1subscript𝑏2𝑡𝑚subscript𝑗subscript𝑑𝑥𝑦subscript𝒬𝑚delimited-[]𝑐𝑗\underline{\underline{M}}^{X}_{\mathrlap{b_{1}\times b_{2}}}(t,m)=\sum_{j\in d% _{xy}}\mathcal{Q}_{m}[c(j)]under¯ start_ARG under¯ start_ARG italic_M end_ARG end_ARG start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , italic_m ) = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_d start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_c ( italic_j ) ] (8)

Where: X{P,S}𝑋𝑃𝑆X\in\{P,S\}italic_X ∈ { italic_P , italic_S }; x,y𝑥𝑦x,y\in\mathbb{Z}italic_x , italic_y ∈ blackboard_Z, 1xb11𝑥subscript𝑏11\leq x\leq b_{1}1 ≤ italic_x ≤ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 1yb21𝑦subscript𝑏21\leq y\leq b_{2}1 ≤ italic_y ≤ italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

To smooth the densities, we apply Gaussian Kernel Density Estimation to the grid (we experimentally set the bandwidth to 0.8). The density values are then normalized to range from 0 to 1, creating a pseudo-probabilistic distribution for the semantic contexts present in the abstracts for each MeSH term within each domain.

Next, we calculate the distance between the two probability distributions of the scientific and industrial contexts using the Total Variational Distance and the Hellinger distance. We then subtract the distances from 1 to measure the similarity between the two context distributions, effectively quantifying the degree of semantic overlap between the scientific and industrial abstracts for a given MeSH term. The aforementioned steps are repeated per topic for all possible pairs of (y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), resulting in a 12x12 matrix of similarity scores, which we refer to as Δ¯¯12×12(m)subscript¯¯Δ1212𝑚\underline{\underline{\Delta}}_{\mathrlap{12\times 12}}(m)under¯ start_ARG under¯ start_ARG roman_Δ end_ARG end_ARG start_POSTSUBSCRIPT 12 × 12 end_POSTSUBSCRIPT ( italic_m ).

Measurement of the content association

For each topic, we computed a single measure that denoted if the content of scientific abstracts were more likely to be leading the content of industrial abstracts than vice versa. We denote this measurements as the triangular ratio (tr𝑡𝑟tritalic_t italic_r) and formally define it in Equation 9.

tr(Δ¯¯(m))=1+i,j{1,,12}𝐢<𝐣Δ¯¯(m)[i,j]1+i,j{1,,12}𝐢𝐣Δ¯¯(m)[i,j]tr¯¯Δ𝑚1subscript𝑖𝑗112𝐢𝐣¯¯Δ𝑚𝑖𝑗1subscript𝑖𝑗112𝐢𝐣¯¯Δ𝑚𝑖𝑗\text{tr}(\underline{\underline{\Delta}}(m))=\frac{1+\sum_{\begin{subarray}{c}% i,j\in\{1,\dots,12\}\\ \mathbf{i<j}\end{subarray}}\underline{\underline{\Delta}}(m)[i,j]}{1+\sum_{% \begin{subarray}{c}i,j\in\{1,\dots,12\}\\ \mathbf{i\geq j}\end{subarray}}\underline{\underline{\Delta}}(m)[i,j]}tr ( under¯ start_ARG under¯ start_ARG roman_Δ end_ARG end_ARG ( italic_m ) ) = divide start_ARG 1 + ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i , italic_j ∈ { 1 , … , 12 } end_CELL end_ROW start_ROW start_CELL bold_i < bold_j end_CELL end_ROW end_ARG end_POSTSUBSCRIPT under¯ start_ARG under¯ start_ARG roman_Δ end_ARG end_ARG ( italic_m ) [ italic_i , italic_j ] end_ARG start_ARG 1 + ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i , italic_j ∈ { 1 , … , 12 } end_CELL end_ROW start_ROW start_CELL bold_i ≥ bold_j end_CELL end_ROW end_ARG end_POSTSUBSCRIPT under¯ start_ARG under¯ start_ARG roman_Δ end_ARG end_ARG ( italic_m ) [ italic_i , italic_j ] end_ARG (9)

Here, the numerator (upper triangular) compiles the sum of elements where scientific abstracts are posited to influence industrial counterparts at subsequent time points, while the denominator (lower triangular) aggregates the elements reflecting the opposite—industrial influence on science. The ratio tr𝑡𝑟tritalic_t italic_r thereby provides a concise metric of the directional disparity in the distribution of content themes: a value greater than 1 suggests a trend where scientific advancements inform industrial activities (tr>1𝑡𝑟1tr>1italic_t italic_r > 1, indicating industry lags science), and conversely, a value less than 1 points to industrial activities informing scientific advancements (tr<1𝑡𝑟1tr<1italic_t italic_r < 1, indicating science lags industry). The visualization of these content associations is detailed in Figure 4.

Refer to caption
Figure 5: The precentage of CCAUC ratios exceeding one (see "Measurement of trend association") (top) and significant p-values (bottom) for both frequency (see "Representing Data as Paper and SBIR Award Frequency") and impact (see "Representing Data as Paper and Funding Impact") representations of the scientific and industrial data at multiple depths of MeSH terms. The x-axis denotes the topic resolution within the MeSH hierarchy. The error bar represents the deviation of this ratio as we shrink the window size (𝔴𝔴\mathfrak{w}fraktur_w).

Assessment of topic hierarchy on content association

We adopted a similar strategy to the frequency and impact analysis for the context analysis, using the tr𝑡𝑟tritalic_t italic_r ratio, rather than the CCAUC, to assess semantic congruence as we traversed the depth of the MeSH tree. Furthermore, we leveraged the sliding window parameter 𝔴𝔴\mathfrak{w}fraktur_w (refer to Section "Assessment of topic hierarchy on trend association") to systematically select subsets of the similarity matrix Δ¯¯12𝔴×12𝔴subscript¯¯Δ12𝔴12𝔴\underline{\underline{\Delta}}_{12-\mathfrak{w}\times 12-\mathfrak{w}}under¯ start_ARG under¯ start_ARG roman_Δ end_ARG end_ARG start_POSTSUBSCRIPT 12 - fraktur_w × 12 - fraktur_w end_POSTSUBSCRIPT. This selection methodology facilitated the tracking of the evolution of the tr𝑡𝑟tritalic_t italic_r ratio distribution according to different values of 𝔴𝔴\mathfrak{w}fraktur_w. As such, the calculated error bounds reflect the standard deviation of the corresponding tr𝑡𝑟tritalic_t italic_r ratios at each depth level.

Refer to caption
Figure 6: The Mean MCC lag (see "Assessment of topic hierarchy on trend association") for the impact representation of the scientific and industrial data, decomposed by the topics at the first level of the MeSH hierarchical tree. The MCC lag signifies the time delay at which science and industry trends are most correlated. The lengths of the bars correspond to the average MCC lag for each term and its child terms. MCC lags between 0 to 11 indicate that science impacts industry the most at that delay, while lags ranging from 0 to -11 indicate the vice versa influence. The colors of the bars represent the proportion of greater-than-one CCAUC ratios among a topic and its children topics. A higher proportion (cooler colors) implies a science-to-industry influence for the majority of the child terms, while a lower proportion (warmer colors) suggests an industry-to-science impact.

Interdisciplinary Research and Innovation

We also investigated the research questions in Section "Research Questions" among scientific studies and small businesses focusing on interdisciplinary topics. As scientific and industrial abstracts are labeled with sets of MeSH topics, we shifted our methodology from identifying these abstracts as instances of individual MeSH topics to labeling them based on pairs of these topics. Given the vast number of possible pairs, we applied the Pareto principle and selected the top 20% most frequent pairs, amounting to approximately 35k pairs. Subsequently, we carried out frequency, impact, and context analyses as outlined in Sections "Methods for Research Question 1" and "Methods for Research Question 2".

Results

Results for Research Question 1

In Figure 5, we present the results of trend analyses for both the frequency (see "Representing Data as Paper and SBIR Award Frequency") and impact (see "Representing Data as Paper and Funding Impact") representations of scientific and industrial data across various MeSH term depths, as illustrated in the top subfigure. The frequency trend analysis showed a stable percentage of CCAUC ratios greater than one, with approximately 88% of topics at the primary depth (d=1). This percentage decreased slightly to 85% as we delved deeper into the MeSH layers. While our analysis spanned all 13 MeSH levels, the figure visually represents up to the sixth depth for clarity, given that the percentage differences in levels 7 to 13 were minimal (±plus-or-minus\pm± 2%). This analysis revealed a consistently strong influence of contemporary scientific activities on future industrial projects across different MeSH topic depths. This notable observation not only illustrates the inherent characteristic of frequency modeling but also validates the presumption that most scientific advancements will eventually find a commercial application among small businesses. This is especially pertinent since we selected the set of MeSH topics that had at least one industrial project granted based on that topic. This implies a broad and general interest from industry in scientific outcomes, which is reflected across different research depth layers. The results for the impact trend, on the other hand, indicated a percentage that ranged between 76% to 74% across various depths, demonstrating a slightly decreasing yet stable influence as we navigated deeper into the MeSH hierarchy. Similarly to the frequency trend, the impact trend remained consistent across different depth levels, suggesting a steady influence of science on industry, which holds even with the increasing volume of scientific outputs. Crucially, this influence indicates that industrial funding is not merely driven by the quantity of scientific production. Instead, it underscores the industry’s appreciation for the applicability and potential innovation stemming from impactful scientific findings.

The GC test, depicted in the bottom subfigure, further elucidated the influence of scientific advancements on industrial activities. For the frequency trend analysis, about 80% of MeSH topics across different depths consistently exhibited p-values less than 0.05, underscoring the significant predictive power of contemporary scientific activities on future industrial projects. In contrast, the impact trend analysis displayed a moderate range from 78% to 73% for significant p-values across the MeSH hierarchy. This suggests that while there’s a predominant influence of impactful scientific activities on the industrial domain, the intensity of this influence experiences a slight tapering as we traverse deeper into the MeSH layers.

In Figure 6, we reveal the direction and time lag between scientific and industrial trends for the top level of the MeSH hierarchy. The analysis implies that the latency of science’s impact on industrial funding varies significantly by topic. In particular, domains like "Anatomy" and "Chemicals and Drugs" exhibit a notably extended latency in the influence of scientific discoveries on industry, as compared to fields such as ’Information Science,’ where the data suggests a reciprocal influence with industrial trends potentially shaping scientific research. These findings highlight the complex and nuanced interplay between scientific inquiry and industrial application across different biomedical disciplines.

Refer to caption
Figure 7: The precentage of triangular (tr𝑡𝑟tritalic_t italic_r) ratios exceeding one (see "Measurement of the content association") for contextual text embedding (see "Representing Data as Paper and SBIR Award Frequency") representations of the scientific and industrial data at multiple depths of MeSH terms. The x-axis denotes the topic resolution within the MeSH hierarchy, while the y-axis indicates the proportion of tr𝑡𝑟tritalic_t italic_r ratios in that topic set exceeding one. The error bar represents the deviation of this ratio as we shrink the window size (𝔴𝔴\mathfrak{w}fraktur_w).

Results for Research Question 2

In Figure 7, we present our analysis on the temporal associations between the semantic content of scientific papers and SBIR abstracts. We utilized UMAP and the Hellinger distance to reduce the dimensionality of text representations and to assess the similarity between pairs of semantics-based probability distributions. Our results underscore a notable degree of science-to-industry influence, evidenced by approximately 88% of the tr𝑡𝑟tritalic_t italic_r ratios exceeding one at the first depth and exhibits a downward trend as we delve deeper into the MeSH tree, dropping to 70%. Although not depicted in the figure, it important to mention that both the Total Variational Distance (TVD) and Hellinger distance (HD) for PCA, in addition to the TVD for UMAP, begin from the same 88% benchmark. As depth increases, these metrics settle within a range from 70% to 75%, with a variation of about +5%. Depths 7 to 13 showed results consistent with the fifth depth, deviating by only (±plus-or-minus\pm± 1%) across all measures. This downward pattern potentially stems from the increasing specificity of the scientific abstracts deeper into the tree, resulting in less semantic overlap with the generally broader industrial abstracts. This analysis underscores the persistent and nuanced semantic impact of current, impactful scientific advancements on future industrial funding.

Results for Interdisciplinary Studies

The incorporation of interdisciplinary studies into our analysis yielded compelling findings. Using frequency, impact, and contextual analyses, we identified a substantial science-to-industry influence. This was evidenced by CCAUC ratios greater than one in 82.5% of topics for frequency analysis and 76% for impact analysis, along with a 70% greater-than-one tr𝑡𝑟tritalic_t italic_r ratio for the contextual analysis. Although depth-wise investigation is not applicable due to the diverse origins of high-frequency terms, the robust correlations underscore the substantial influence of interdisciplinary scientific advancements in shaping industry.

Discussion

Key Findings

Science as a leading indicator for industrial innovation funding:

Our analyses reveal that topics with the most influential scientific activities (i.e. those with more citations) are also the most likely to see future allocations of industrial funds. More specifically, for up-to 76% of topics investigated, the scientific interest in papers from those topics were associated with the total funding allocated to small businesses working on those topics in the future. This result provides evidence that science informs industrial innovation funding decisions.

Science as a leading indicator for industrial innovation topics:

Our analysis reveals that the semantic contents of scientific abstracts within a topic are associated with the future semantic contents of grant applications of small businesses working on those topics. More specifically, for approximately 75.6% of topics examined, text embeddings of scientific abstracts were associated with future industrial text embeddings. These findings prove that science influences the direction of industrial innovation activities within topics.

Impact of Science on Industry

The primary objective of our analysis was to investigate how current scientific advancements impact future industrial innovation. The frequency, impact, and context analyses provided multifaceted insights into the dynamics of this process, revealing how scientific progress influences the allocation of industrial funding across different thematic depths and contexts. The frequency analysis demonstrated that various scientific activities significantly inform future industrial projects across varying depths of MeSH topic categorization. This pattern indicates a broad, general interest from industry in scientific outcomes, with a substantial portion of scientific discourse finding its way into commercial application. Meanwhile, the impact analysis refined this understanding by modeling the evolution of science and industry by capturing the level of interest each work attracted in its respective field, showing substantial influence in guiding industrial investments and directions, regardless of the sheer volume of scientific production. This balance between quantity and quality elucidates a nuanced interplay between scientific research and industrial innovation, reaffirming the pivotal role that impactful science plays in shaping industrial progress. Shifting the focus to context analysis, it accentuated the profound semantic influence of scientific discourse on industrial innovation. However, the deeper we went into the MeSH tree, the more specific and technical language we encountered in scientific abstracts. However, SBIR abstracts tend to maintain a more general vocabulary, which often doesn’t reflect these highly specific terms. This semantic divergence leads to a decrease in the percentage of terms with a tr𝑡𝑟tritalic_t italic_r ratio above one, indicating a lesser degree of semantic overlap as we move to more specific terms. Consequently, our study portrays the significant influence of science on the industry, emphasizing that this relationship extends beyond the mere volume of scientific output, but is greatly influenced by its impact, the broader themes it advances, and the meaningful narratives it presents, which collectively underline the central role of science in steering industrial innovation.

Limitations

Our study offers insights into the science-industry relationship but has some limitations. The ontology used, while extensive, only represents a specific scientific domain; a broader ontology could yield deeper insights. Expanding the study’s timespan might refine error bounds through advanced statistical methods. While our data source is robust, larger databases might offer further insights. This study centers on analyzing science-industry associations, but future work could predict upcoming trends in this interplay. Nonetheless, our findings set a solid groundwork for subsequent research.

Conclusion

This study aimed to model the interaction between scientific research and industrial innovation using different techniques, including frequency, impact, and context analysis. Our results consistently underscored scientific advancements’ decisive role in shaping industrial innovation. Influential scientific activities substantially align with future industrial funding, and the thematic content of scientific discourse profoundly affects industrial innovation. These findings illuminate the nuanced interplay between science and industry, which is dictated by not only the quantity of scientific output but also its relevance and impact. Future research can capitalize on our findings while also addressing the limitations outlined herein. Such work would contribute to a greater understanding of the intricate dynamics between scientific exploration and industrial innovation.

Acknowledgements

This research was funded in part by the Faculty Research Awards of J.P. Morgan AI Research. The authors are solely responsible for the contents of the paper and the opinions expressed in this publication do not reflect those of the funding agencies.

Disclaimer This paper was prepared for informational purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

References

  • [1] Luo, J., Wu, M., Gopukumar, D. & Zhao, Y. Big data application in biomedical research and health care: A literature review. \JournalTitleBiomedical Informatics Insights 8, 1, DOI: 10.4137/BII.S31559 (2016).
  • [2] Jinek, M. et al. A programmable dual-rna-guided dna endonuclease in adaptive bacterial immunity. \JournalTitleScience (New York, N.Y.) 337, 816–21, DOI: 10.1126/science.1225829 (2012).
  • [3] El Mounadi, K., Morales-Floriano, M. & Garcia-Ruiz, H. Principles, applications, and biosafety of plant genome editing using crispr-cas9. \JournalTitleFrontiers in Plant Science 11, DOI: 10.3389/fpls.2020.00056 (2020).
  • [4] Shammas, M. A. Telomeres, lifestyle, cancer, and aging. \JournalTitleCurrent opinion in clinical nutrition and metabolic care 14, 28–34, DOI: 10.1097/MCO.0b013e32834121b1 (2011).
  • [5] for Science, N. C. & Statistics, E. Federal budget authority for r&d and r&d plant for national defense and civilian functions totaled $191 billion in fy 2023 proposed budget. https://ncses.nsf.gov/pubs/nsf23323 (2023). Accessed on January 26, 2023.
  • [6] Jürgens, B. & Herrero-Solana, V. Patent bibliometrics and its use for technology watch. \JournalTitleJournal of Intelligence Studies in Business 7, 17–26, DOI: 10.37380/jisib.v7i2.236 (2017).
  • [7] Skute, I., Zalewska-Kurek, K., Hatak, I. & Weerd-Nederhof, P. Mapping the field: A bibliometric analysis of the literature on university–industry collaborations. \JournalTitleThe Journal of Technology Transfer 44, 916–947, DOI: 10.1007/s10961-017-9637-1 (2019).
  • [8] Magadán Díaz, M. & García, J. Publishing industry: A bibliometric analysis of the scientific production indexed in scopus. \JournalTitlePublishing Research Quarterly 38, DOI: 10.1007/s12109-022-09911-3 (2022).
  • [9] Cobo, M., Jürgens, B., Herrero-Solana, V., Martínez, M. & Herrera-Viedma, E. Industry 4.0: a perspective based on bibliometric analysis. \JournalTitleProcedia Computer Science 139, 364–371, DOI: https://doi.org/10.1016/j.procs.2018.10.278 (2018). 6th International Conference on Information Technology and Quantitative Management.
  • [10] Krestel, R., Chikkamath, R., Hewel, C. & Risch, J. A survey on deep learning for patent analysis. \JournalTitleWorld Patent Information 65, 102035, DOI: https://doi.org/10.1016/j.wpi.2021.102035 (2021).
  • [11] Zhu, Y., Wang, Y., Zhou, B., Hu, X. & Xie, Y. A patent bibliometric analysis of carbon capture, utilization, and storage (ccus) technology. \JournalTitleSustainability 15, DOI: 10.3390/su15043484 (2023).
  • [12] Puliga, G., Bono, F., Gutierrez Tenreiro, E. G. & Strozzi, F. Bibliometric analysis of scientific publications and patents on smart cities. Tech. Rep. JRC129102, Publications Office of the European Union (2023). DOI: 10.2760/074691.
  • [13] Wang, L. & Li, Z. Knowledge flows from public science to industrial technologies. \JournalTitleThe Journal of Technology Transfer 1–24, DOI: 10.1007/s10961-019-09738-9 (2021).
  • [14] Chakraborty, M., Byshkin, M. & Crestani, F. A. Patent citation network analysis: A perspective from descriptive statistics and ergms. \JournalTitlePLoS ONE 15, DOI: https://doi.org/10.1371/journal.pone.0241797 (2020).
  • [15] Tan, W., Jing, L., Wang, Y. & Li, W. A global bibliometric analysis on kawasaki disease research over the last 5 years (2017-2021). \JournalTitleFrontiers in Public Health 10, 1075659, DOI: 10.3389/fpubh.2022.1075659 (2022).
  • [16] Farooq, K., Ur Rehman, S., Ashiq, M., Siddique, N. & Ahmad Phd, S. Bibliometric analysis of coronavirus disease (covid-19) literature published in web of science 2019-2020. \JournalTitleJournal of Family and Community Medicine 28, 1–7, DOI: 10.4103/jfcm.JFCM_332_20 (2021).
  • [17] Mardikoraem, M. & Woldring, D. Protein fitness prediction is impacted by the interplay of language models, ensemble learning, and sampling methods. \JournalTitlePharmaceutics 15, DOI: 10.3390/pharmaceutics15051337 (2023).
  • [18] Maghrabi, Y., Ashgar, M., Aljohani, S., Alqarni, R. & Baeesa, S. Three decades of spine surgery research evolution in saudi arabia: A bibliometric analysis. \JournalTitleJournal of Spine Practice (JSP) 2, 51–60, DOI: 10.18502/jsp.v2i2.12627 (2023).
  • [19] Rubini, S., Chandrasekar, K., Janen, T. & Sriskandarajah, N. Water quality in northern province of sri lanka: A bibliometric analysis of publications 1960–2021. \JournalTitleWorld Water Policy n/a, DOI: https://doi.org/10.1002/wwp2.12117 (2023).
  • [20] Mohana Murali, S., Senthamarai Kannan, K. & Samuel, M. Bibliometric analysis of the scientific literature on human papillomavirus vaccine clinical trials: Analysis of pubmed database. \JournalTitleNational Journal of Community Medicine 14, 424–32, DOI: 10.55489/njcm.140720232951 (2023).
  • [21] Audretsch, D. B., Link, A. & van Hasselt, M. Knowledge begets knowledge: university knowledge spillovers and the output of scientific papers from u.s. small business innovation research (sbir) projects. \JournalTitleScientometrics 121, 1367 – 1383 (2019).
  • [22] Hayter, C. & Link, A. From discovery to commercialization: accretive intellectual property strategies among small, knowledge-based firms. \JournalTitleSmall Business Economics DOI: 10.1007/s11187-021-00446-z (2021).
  • [23] Pubmed. Internet (2022). [Accessed: 2022-12-01].
  • [24] SBIR awards data (2022). [Accessed: 2022-12-01].
  • [25] Neumann, M., King, D., Beltagy, I. & Ammar, W. ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, DOI: 10.18653/v1/w19-5034 (Association for Computational Linguistics, 2019).
  • [26] Bodenreider, O. The unified medical language system (umls): Integrating biomedical terminology. \JournalTitleNucleic acids research 32, D267–70, DOI: 10.1093/nar/gkh061 (2004).
  • [27] Abbasi, A., Miahi, E. & Mirroshandel, S. A. Effect of deep transfer and multi-task learning on sperm abnormality detection. \JournalTitleComputers in Biology and Medicine 128, 104121, DOI: https://doi.org/10.1016/j.compbiomed.2020.104121 (2021).
  • [28] Wang, L. et al. Text embeddings by weakly-supervised contrastive pre-training (2022). 2212.03533.
  • [29] McInnes, L., Healy, J., Saul, N. & Großberger, L. Umap: Uniform manifold approximation and projection. \JournalTitleJournal of Open Source Software 3, 861, DOI: 10.21105/joss.00861 (2018).
  • [30] van der Maaten, L. & Hinton, G. Visualizing data using t-sne. \JournalTitleJournal of Machine Learning Research 9, 2579–2605 (2008).

Author contributions

Conceptualization, methodology, investigation, modeling, validation, and manuscript writing were performed by R.K. and M.M.G.—Review and scientific editing were performed by S.K., C.H.S., T.A., I.B., A.N., M.M.G., and R.K.—Data processing was executed by R.K., M.M.G and T.K.—Project administration: M.M.G. All authors reviewed the manuscript.

Competing Interests

The authors declare no competing interests.

Data Availability

All data utilized in this research, including datasets from PubMed, SBIR, and MeSH, are publicly available and accessible. The implementation, including code and relevant files, can be found at the project’s GitHub repository (https://github.com/HAAIL/science-impacts-industry).