Predicting superconducting transition temperature through advanced machine learning and innovative feature engineering

Gashmard, Hassan; Shakeripour, Hamideh; Alaei, Mojtaba

doi:10.1038/s41598-024-54440-y

Download PDF

Article
Open access
Published: 17 February 2024

Predicting superconducting transition temperature through advanced machine learning and innovative feature engineering

Hassan Gashmard¹,
Hamideh Shakeripour¹ &
Mojtaba Alaei¹Â

Scientific Reports volumeÂ 14, ArticleÂ number:Â 3965 (2024) Cite this article

2743 Accesses
1 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Superconductivity is a remarkable phenomenon in condensed matter physics, which comprises a fascinating array of properties expected to revolutionize energy-related technologies and pertinent fundamental research. However, the field faces the challenge of achieving superconductivity at room temperature. In recent years, Artificial Intelligence (AI) approaches have emerged as a promising tool for predicting such properties as transition temperature (T_c) to enable the rapid screening of large databases to discover new superconducting materials. This study employs the SuperCon dataset as the largest superconducting materials dataset. Then, we perform various data pre-processing steps to derive the clean DataG dataset, containing 13,022 compounds. In another stage of the study, we apply the novel CatBoost algorithm to predict the transition temperatures of novel superconducting materials. In addition, we developed a package called Jabir, which generates 322 atomic descriptors. We also designed an innovative hybrid method called the Soraya package to select the most critical features from the feature space. These yield R² and RMSE values (0.952 and 6.45Â K, respectively) superior to those previously reported in the literature. Finally, as a novel contribution to the field, a web application was designed for predicting and determining the T_c values of superconducting materials.

From individual elements to macroscopic materials: in search of new superconductors via machine learning

Article Open access 02 May 2023

Magnetic and superconducting phase diagrams and transition temperatures predicted using text mining and machine learning

Article Open access 13 March 2020

3DSC - a dataset of superconductors including crystal structures

Article Open access 21 November 2023

Introduction

The amazing properties of superconducting materials are a direct consequence of quantum mechanics that emerge on a large scale¹. The two basic characteristics of superconductors that make this class of materials different from others include: a) offering no resistance to the flow of electric currents, and b) complete exclusion of magnetic field². No comprehensive theory capable of predicting the transition temperatures (T_c) of superconducting materials has yet been presented to date and the discovery of new superconductors still relies on expert intuition and is largely dependent on trial and error based on experience³. Hence, empirical laws have for many years served as guides for researchers in their efforts to fabricate new superconducting materials⁴.

Condensed Matter Physics strives to discover the interactions of materials at the atomic level since material properties are derived from these interactions⁵. Prediction and determination of the microscopic properties of materials presuppose the solution of the Schrodinger equation for a Many-Body system. However, solving this equation for such systems is practically impossible due to the vast Hilbert space needed to handle them, especially for highly correlated materials. Consequently, a solution adopted in most cases is to employ approximate methods^6,7,8. One of these methods is Density Functional Theory (DFT) which is based on the HohenbergâKohn and KohnâSham theorems and has a substantial record of success in predicting material properties and solving the associated quantum mechanics problems^9,10,11. Despite its outstanding achievements, the theory has some limitations in its current form; for instance, it employs approximation for exchangeâcorrelation functional, yields errors when used for strong correlation systems, can only be employed for a small number of atoms, and is hampered by increasing computational costs and runtime with increasing system size^{10,12,13,14,15}. Strong electronâelectron correlations in superconducting materials make it extremely challenging to perform first-principles calculations to determine their structural properties and predict their T_c^3,9, making searching for novel alternative approaches inevitable.

As alternative strategies for solving quantum mechanics problems, machine learning methods offer lower computation costs, shorter execution times, accurate predictions, and faster development cycles^9,12,13. Being data-driven and given the fact that huge amounts of data have been produced over the years, machine learning methods encourage researchers to utilize them for discovering novel materials and predicting their properties^4,5,6,16. Materials Science is nowadays said to have entered its fourth stage of evolution, termed âData-Based Materials Scienceâ, a term borrowed from Thomas Samuel Kuhn to describe the fieldâs development^6,12,17,18. FigureÂ 1 illustrates the four (empirical, theoretical, computational, and data-driven) paradigms of materials science. To date, large amounts of theoretical and experimental data have been collected in the three traditional (i.e., empirical, theoretical, and computational) paradigms; the next step, logically, is to apply the new innovative tools developed by artificial intelligence, which are capable of extracting knowledge from such data^{6,12,18,19,20,21,22}.

Given the importance of the T_c values of superconducting materials, researchers have in recent years developed machine learning-based models for predicting this quantity. Selecting 21,263 superconducting materials and utilizing 80 atomic descriptors for each compound, Hamidieh⁴ used the XGBoost algorithm to design a model for predicting of T_c. Stanev et al.¹⁶ employed the Random Forest algorithm to develop a model using 132 atomic features of Magpie descriptors for 6196 superconducting compounds. Konno et al.³ implemented a convolutional neural network (CNN) model (i.e., a deep learning model) to predict the T_c values of about 13,000 superconducting materials. They represented their materials using an innovative âperiodic table readingâ method. The dimensions of the representation were 4âÃâ32âÃâ7, with 4 representing the four orbitals of s, p, d, and f corresponding to the valence electrons of each element in a compound, and 32 and 7 denoting the dimensions of the periodic table. Dan et al.²³ developed the ConvGBDT model by merging the convolutional neural network (CNN) and the gradient boosting decision tree (GBDT) models. For the three datasets of DataS, DataH, and DataK, the authors used the Magpie descriptors to represent materials and the ConvGBDT model to predict T_c values. Li et al.¹¹ introduced a hybrid neural network (HNN) model as a combination of a convolutional neural network (CNN) and a long short-term memory neural network (LSTM). They utilized atomic vectors and employed both the one-hot and Magpie material characterization methods to represent superconductors in the feature space. The authors found that the Magpie features generally outperformed the one-hot features. Roter et al.²⁴ employed the Bagged Tree method (a variant of the Random Forest algorithm) to design a model for predicting T_c. They represented superconducting materials using a chemical composition matrix as the feature space. The matrix had about 30,000 rows and 96 columns, wherein each row corresponded to a chemical formula, and the columns contained the 96 primary elements of the periodic table. Each entry in this matrix was filled with an index corresponding to the elements of each chemical compound. Quinn et al.²⁵ utilized a Crystal Graph Convolutional Neural Network (CGCNN) model to integrate classification and regression models within a pipeline to identify candidates of high-temperature superconductors from among the 130,000 compounds in the Materials Project. In the crystal-graph representation of materials, the connections between atoms represent the graphâs edges, and the locations of the atoms and their properties represent the vertices.

The main objective of the current research is to design a suitable and reliable model for predicting the T_c values of superconducting materials using machine learning approaches. While the algorithm and the dataset are the two indispensable research tools in data science, the present study attaches more importance to the dataset than the algorithm. After carefully cleaning data, we generate a suitable feature space for superconducting materials. The main advantages of the present work over previous ones include: (1) Establishing more appropriate feature space related to superconducting T_c and (2) Identifying the features most related to the T_c values of superconducting materials. We reach significant results by designing the Jabir package to produce 322 atomic features for each compound and Soraya package for selecting features.

Data

Data set

Two essential steps must be taken before statistical learning can predict T_c in superconducting materials. The first involves collecting and preprocessing a dataset, and the second is adopting a suitable algorithm for the learning process and model development on that dataset. According to Halevy et al.²⁶, the first step is of greater significance as data scientists typically devote about 80% of their efforts to datasets and their preprocessing²⁷; the same is valid with the present work using SuperCon dataset (https://doi.org/10.48505/nims.3739), currently the largest and most comprehensive superconducting materials database containing 33,407 superconducting compounds.

Here, a significant contribution is done by executing distinct steps of data pre-processing and providing detailed explanations for each step. Ultimately, following the implementation of various data pre-processing phases and the exclusion of problematic data, the DataG dataset consisting of 13,022 superconducting compounds is derived.

Cleaning the dataset

Dealing with missing and duplicated data

The SuperCon dataset lacks the transition temperature values for 7088 compounds. These cases are identified as missing data and removed from the dataset. Along with that, we remove 7418 data duplications. Among these, 1264 compounds are regarded as duplicates due to the displacement of data elements; examples include: MgB₂, B₂Mg, Ag₇B₁F₄O₈, Ag₇F₄O₈B₁, Al_0.1Si_0.9V₃, V₃Si_0.9Al_0.1, Zr₂Co₁, Co₁Z₂, â¦.

Dealing with problematic data

(1) We eliminate 5348 compounds whose element subscripts are X, Y, Z, D, x, y, z, and d. (2) We remove problematic compounds such as: HgSr₂Ho_0.333Ce_0.667Cu₂O_6=z, Ba₂Cu_1.2Co_2.4O_2,4, Ag₇Bf₄O₈, Hg_0.3Pb_0.7Sr_1.75La_0.25CuO₄₊₂, Ho_0.8Ca_0.2Sr₂Cu_2.8P_0.2O_z+0.8, Bi_1.6Pb_0.4Sr₂Ca₂Cu₃F_0.8O_z-0.8. (3) Compounds containing the elements not included in the periodic table are ignored. (4) Given the objective of predicting transition temperatures for superconductors at ambient pressure, those created under non-ambient pressures (e.g., La₁H₁₀, H₂S₁, H₃S₁, D₃S₁, â¦) are removed from the dataset. (5) The compound YBa₂CuO₆₀₅₀ is eliminated on the grounds that the oxygen subscript of 6050 might be incorrect⁴. (6) We dismissed 70 compounds whose transition temperatures are reported to be zero. (7) Finally, the compound Pb₂CAg₂O₆ is discarded due to the unreasonable transition temperature of 323Â K reported for this compound.

Data correction

(1) According to the SuperCon reference²⁸, the transition temperature of the iron-based superconductor CsEuFe₄As₄ is nearly 30Â K, while the SuperCon dataset records it as 287Â K. Therefore, it is modified to 28.7Â K. Moreover, the compound Sm₁Ba_-1Cu₃O_6.94 is substituted with Sm₁Ba₁Cu₃O_6.94. (2) Bi_1.6Pb_0.4Sr₂Cu₃Ca₂O₁₀₁₃ is altered to Bi_1.6Pb_0.4Sr₂Cu₃Ca₂O_10.13 because the nearby data rows containing formulas with O_10.xx⁴.

Dealing with multiple temperatures reported for a single compound

One limitation in the SuperCon dataset is the presence of multiple T_c values reported for 2132 compounds, posing a challenge for accurate analysis. For instance, MgB₂ alone has been reported to exhibit 47 different transition temperatures ranging from 5 to 40.5Â K. To tackle this challenge, it has been recommended to consider average transition temperatures for compounds that have multiple T_c values reported in the dataset. Prior to determining the average T_c value, it is essential to exclude compounds whose reported transition temperatures display significant dispersion. To achieve this, the standard deviation of the different transition temperatures for each compound is calculated and compounds with standard deviations greater than 20Â K are removed from the dataset. Performing this procedure leads to the elimination of 18 compounds.

Detecting outliers

Undoubtedly, outliers in a dataset can pose problems in identifying underlying patterns, resulting in diminishing system performance and accuracy²⁹. In this study, the outlier data are detected using the Z-score method³⁰ and the PyOD package³¹, both renowned tools in the field of anomaly detection. After a meticulous examination, the outliers are identified and excluded according to the three following distinct aspects:

(1)
Transition temperature: The average transition temperature of remaining compounds ranges from 0.0005 to 250Â K. Using the abovementioned techniques, 10 superconducting materials with average transition temperatures outside the 0.01â136Â K range are identified as outliers and removed. FigureÂ 2 illustrates the T_c distribution of the few superconducting material families.
(2)
Number of elements: Fig.Â 3 shows the number of compounds according to the number of constituent elements. A subset of compounds with one, eight, and ten elements are identified as outliers and subsequently removed, resulting in the elimination of 81 superconducting compounds.
(3)
The summation of subscripts: Implementing the abovementioned techniques reveals that six compounds exhibited subscript summations exceeding 100 that are subsequently removed as outliers.

Our meticulous data-cleaning procedures yield a refined dataset, called DataG dataset, containing 13,022 compounds.

Computational methods

Machine learning algorithm

In this study, we use the CatBoost algorithm as a machine learning ensemble technique based on Gradient Boosted Decision Trees (GBDT) proposed by Yandex Company. GBDT is an efficient tool for solving regression and classification issues in big data sets. CatBoost is a Decision Tree based algorithm and open-source implementation for supervised machine learning that involves two innovations: Ordered Target Statistics and Ordered Boosting. Researchers have successfully employed CatBoost for machine learning investigations incorporating Big Data since its launch in late 2018. Numerous applications have been reported for CatBoost in various fields, including astronomy, finance, medicine, biology, electrical utilities fraud, meteorology, psychology, traffic engineering, cyber-security, biochemistry, and marketing³². However, the application of CatBoost has not yet been reported for predicting superconducting transition temperatures. This study uses the algorithm to find if it can efficiently identify relationships and patterns between features and T_c. We show that through the creation of atomic features for superconducting material, CatBoost algorithm provides a model with very good accuracy.

Generating the feature space

After preprocessing the data set, we must extract atomic features in a âdata representationâ procedure. There are two main approaches for representing compounds: The first is based on chemical formulas, and the second on crystal structure²³. The atomic features are generated for superconducting materials using the first approach for our purposes.

In fact, machine learning algorithms recognize a compound by its characteristics, i.e. the identifier and characteristic of a compound are the features that are consider for the compound. This process is called data representation. FigureÂ 4 shows how to calculate atomic features.

We design and develop the Python language package called Jabir to generate 322 atomic features for each compound of all types, including superconducting materials. The package calculates eight statistical relationships (e.g., variance and mean) for each physical feature (e.g., magnetic moment) based on the three components of Element, Subscript, and Fraction. FigureÂ 5 depicts the workflow of the feature-generating by Jabir.

For illustration, consider the compound Mg_0.9Fe_0.1B₂ composed of three elements. The subscripts are 0.9, 0.1 and 2, while the fractions of the elements in the compound, obtained from Eq.Â 1, are 0.3, 0.033, and 0.666, respectively.

$$Fraction = \frac{Subscript}{{\sum Subscript}}$$

(1)

As mentioned, the atomic features are generated based on the three components of Element, Subscript, and Fraction. The fraction-based atomic features are multiplied by the fraction of the element in the compound.

Similarly, the subscript-based atomic features are multiplied by the subscript of that element in the compound. However, the atomic features based on element (Element-based) are based solely on elemental values; in other words, the elemental value is multiplied by one, ignoring the related fraction or subscript. It should also be noted that the Jabir package solely calculates the element-based atomic features for the four Ionic Radius, Vander Waals Radius, Period Number and Group Number features because the two subscript-based and fraction-based ones are meaningless for these features. Table 1 briefly explains the process used for calculating the mean thermal conductivity of the Mg_0.9Fe_0.1B₂ compound, for instance. To learn more about Jabir's features, see the Supplementary Information; we have explained briefly all the 30 most significant features depicted in Fig.Â 7.

Table 1 Calculation of the mean thermal conductivity for Mg_0.9Fe_0.1B₂ compound.

Full size table

At first glance, it seems fraction-based and subscript-based are the same thing, and we should choose one. However, the fraction of some elements in some compounds consisting of the same elements can be equal while they have different subscripts. The subscript-based must also be considered in atomic features space to account for this difference. For instance, the compounds Y₁Fe₂Si₂ and Y₂Fe₃Si₅ have identical fractions of âYâ (namely, 0.2), while it is rational to think that this same element has different effects in these two compounds. Clearly, among the three types of atomic features, only the subscript-based one accounts for this difference, demonstrating the reason why the subscript-based feature must be used.

Feature selection

Feature selection methods are utilized to determine the best feature subset. Some advantages of feature selection include reduced overfitting, improved accuracy, reduced training time, simplified model design, faster convergence, enhanced generalization, and improved robustness to noise^33,34,35,36. Feature selection methods help pick out the subset of attributes most relevant to the T_c of superconducting materials. Generally, for a feature space of N features, there are 2^N subsets of features. For example, the current feature space with 322 features leads us to select a subset among 2³²² â 8.54âÃâ10⁹⁶ feature subsets. Due to the enormous number of subsets, we need sophisticated methods to overcome computational costs. There are generally four general methods for selecting a subset of features: filter, wrapper, embedded, and hybrid^33,35,37. In this study, the various feature selection techniques are tested carefully and evaluated for their efficiency and effectiveness against such evaluation criteria as the coefficient of determination (R²), Eq.Â 2, and root mean square error (RMSE), Eq.Â 3²³. Finally, we developed a novel and innovative hybrid method. This method has been published in the form of a Python package called Soraya.

$$R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \overline{y}} \right)^{2} }}$$

(2)

$$RMSE = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} }$$

(3)

Using the proposed feature selection technique, 30 of the most important features generated by the Jabir package for DataG are selected. The results of these comparisons are reported in Table 2.

Table 2 Evaluation and comparison of various feature selection techniques (bold: in this study).

Full size table

The proposed innovative hybrid method for selecting the best subset of feature space calculates the two-by-two correlation of all 322 features in its first step. In the second step, all the features with absolute Pearson correlation criterion values greater than 0.80 are grouped into distinct clusters. The Pearson correlation criterion is a parametric statistical method that allows to determine the existence or absence of a linear relationship between two quantitative variables³⁸. This step categorizes 304 features into 62 clusters and retains 18 features with correlations less than 0.80 for the following steps. (Each cluster contains a varying number of features. For example, one cluster may consist of only two correlated features, while another may comprise five. Therefore, the rationale behind Soraya's decision to group 304 features into 62 clusters is based on the characteristics of the features presented in the dataset.). This step aims to retain the most important features from each cluster and eliminate the remainder as redundant. Features with a correlation greater than 0.8 mean that they have very strong correlation³⁹. The objective of feature selection is to identify a subset of the original features from a provided dataset by eliminating irrelevant and redundant features⁴⁰. Furthermore, as the feature space dimensions decrease, the learning model's accuracy will increase⁴¹. We should keep only one of those features which they have very strong correlation and remove the others, because they are redundant features that do not provide any new information⁴².

In the third step, the learning process of the model is performed independently for each cluster, with the most significant feature in each cluster being selected and the others eliminated. As a result of grouping the features into 62 distinct clusters, 62 features remain in this step. Once the duplicate features have been deleted, 55 features remain.

In the fourth step, 18 features are added to the 55 features obtained above. Subsequently, the SHAP (SHapley Additive exPlanations) method, which is based on the game theory for explaining the output of machine learning models^43,44, is employed to sort the 73 features according to the significance level. In this step, the SHAP method acts as a filter method.

In the fifth step, the 5 most significant features, as identified and sorted by the SHAP method in the previous step, are initially selected. Using the forward selection (wrapper method), the remaining 68 (i.e., 18â+â55âââ5) features are added one by one to the 5 features, until 30 features are selected from among the most significant ones. (The 30 most significant features give the highest accuracy for the model; The Soraya package is designed in such a way that it shows the amount of accuracy with the addition of each feature.) The steps outlined above are depicted in Fig.Â 6.

The DataG dataset, which contains 13,022 superconducting materials, comprises 83 elements from the periodic table. As a result, 83 columns are created in which the fractions of the constituent elements for each compound are recorded. This process is illustrated in Table 3. Finally, we add this feature vector to the previously selected features to make a final feature space with 113 (83â+â30) dimensions.

Table 3 Ba_0.8Fe₂Se₂ is represented by an array of 83 columns, with each column containing the fraction of constituent elements present in Ba_0.8Fe₂Se₂.

Full size table

Results

Identifying key features

During the feature selection process in Section "Feature Selection", we employed an innovative hybrid technique called âSoraya packageâ to pick 30 of the most significant features. Subsequently, in Fig.Â 7, we sorted these selected features using the capability of the CatBoost algorithm. In the Supplementary Information, we have explained those features depicted in Fig.Â 7.

Across various studies^4,38,45, including the current study, researchers have discovered that the thermal conductivity stands out as the most important feature among different features in determining the T_c of superconducting materials. Theoretically, the thermal conductivity of superconductors provides significant clues about the nature of their charge carriers, phonons, and the scattering processes occurring between them⁴⁶. Thermal conductivity refers to the ability of a material to conduct heat. The significance of thermal conductivity is directly connected to the concentration of particles capable of transferring heat^45,47. The concentration of the superconducting particles (n_s) is related to a characteristic length describing the superconducting state, namely, the London penetration depth (Î»),Â ${\uplambda }^{2}=\frac{m}{{q}^{2}{n}_{s}{\mu }_{0}}$, where m, q, n_s are mass, charge and concentration of superconducting particles respectively and Î¼₀ is magnetic constant^38,48. The transition temperature of a superconductor is associated with both the London penetration depth and the coherence length. In other words, the formation and destruction of the superconducting state is related to the London penetration depth and the coherence length^38,45,48. On the other hand, the results of this study and other studies^4,38,45 show that the superconducting transition temperature has a strong correlation with the thermal conductivity; Among the 322 features, the range of thermal conductivity has the strong correlation (0.68) with the T_c; see Fig.Â 7. Then, it could be concluded that the results of this research are consistent with the results of theoretical works.

Predicting the superconducting materialsâ T _c values

For the DataG dataset, which contained 13,022 superconducting materials, 90% of the dataset is allocated to the training dataset and 10% to the test one (i.e., 1303 compounds). Using the model created during training, the CatBoost algorithm predicted a T_c value for every 1303 compounds in the test dataset (Fig.Â 8). The R² and RMSE evaluation criteria are 0.952 and 6.45, respectively, superior to those reported in the literature. Table 4 compares the values for the R², RMSE, and MAE evaluation criteria obtained in the present study and those reported elsewhere. The model proposed here yields R², RMSE, and MAE values superior to those previously reported.

Table 4 Evaluation values for predicting the T_c of superconducting materials related to the current research (bold) and other works.

Full size table

Furthermore, the procedure employed for DataG (namely, creating new features, selecting features, tuning hyperparameters, etc.) is also applied to DataS, DataK, and DataH datasets, which led to improved evaluation criteria (Table 4).

The model thus developed is subsequently used to predict the T_c values for SmFeAsO_0.8F_0.2, SmFeAsO_0.7F_0.3 and three new Iron-based superconducting materials not included in the data set and for which no T_c values had yet been reported. As shown in Table 5, the T_c value of the main compound increases as the Fluorine element increases to an optimal doping content. This increase in T_c aligns with the experimental results. Table 5 also indicates that we can play around with elements (e.g., substitution, changing contribution) and find trends for increasing T_c for a specific compound, which can help material scientists to design high-temperature superconductors. It should be emphasized that none of the compounds mentioned in Table 5 are included in the SuperCon dataset.

Table 5 Prediction of T_c of three new Iron-based superconducting compounds using machine learning model.

Full size table

Moreover, the model was used to predict the T_c values for a few superconducting compounds not included in the DataG but for which T_c values had been previously reported in the literature. The results are provided in Table 6 for comparison. Clearly, a great agreement can be observed between the two T_c values obtained by experiments and by the machine learning method used in this study.

Table 6 The machine learning model developed in this study accurately predicts the T_c of several superconducting compounds, none of which are present in the original SuperCon dataset.

Full size table

Conclusion

In the realm of materials science, artificial intelligence stands as a powerful tool for predicting material properties. In this study, the CatBoost algorithm was employed to predict the T_c values of superconducting materials, marking a novel approach. For this purpose, data pre-processing of the SuperCon dataset was accomplished as a significant step in data science to develop a new dataset called DataG containing 13,022 superconducting compounds. Also, a new Jabir package capable of generating 322 atomic descriptors was designed and developed. Comparisons revealed the superiority of the atomic features generated by Jabir over those generated by such previous ones as the Magpie package. Furthermore, an innovative hybrid technique was developed as the feature selection method (Soraya package). In order to design and develop Jabir and Soraya packages, we applied novel ideas and innovative approaches, such as: (i) using new and diverse physical atomic features in the Jabir package and considering three different states (Elemental, Subscript, Fraction) in order to calculate the atomic features of each compound and (ii) using an innovative hybrid technique in Soraya package, removing features that are highly correlated with each other (removing redundant features) and using SHAP's technique to select the most important features and finally using the forward method to adding the most important features. The contributions of the study led to optimized evaluation values (R², RMSE, MAE) of DataH, DataS, and DataK datasets without the need for any data pre-processing. The present studyâs results indicate that the procedure of selecting the most important descriptors significantly impacts predicting superconducting materialsâ T_c values. Finally, the development of a novel web application was a pioneering contribution to the field for predicting and determining the T_c of superconducting materials.

Data availability

The dataset (DataG), which is prepared after various steps of data pre-processing on the SuperCon dataset, is available at the following address. https://github.com/Gashmard/DataG_13022_superconducting_materials

Code availability

The developed packages (Jabir and Soraya) and the web application are accessible at the following URLs. Web application: https://supercon-tc.iut.ac.ir/

Jabir package: https://pypi.org/project/jabir/

Soraya package: https://pypi.org/project/soraya/

Jabir package on Github: https://github.com/Gashmard/jabir

Soraya package on Github: https://github.com/Gashmard/Soraya

References

Annett, J. F. Superconductivity, Superfluids and Condensates Vol. 5 (Oxford University Press, Oxford, 2004).
BookÂ Google ScholarÂ
Hosono, H. et al. Recent advances in iron-based superconductors toward applications. Mater. Today 21(3), 278â302 (2018).
ArticleÂ CASÂ Google ScholarÂ
Konno, T. et al. Deep learning model for finding new superconductors. Phys. Rev. B 103(1), 014509 (2021).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Hamidieh, K. A data-driven statistical model for predicting the critical temperature of a superconductor. Comput. Mater. Sci. 154, 346â354 (2018).
ArticleÂ CASÂ Google ScholarÂ
Bedolla, E., Padierna, L. C. & Castaneda-Priego, R. Machine learning for condensed matter physics. J. Phys. Condens. Matter. 33(5), 053001 (2020).
ArticleÂ ADSÂ Google ScholarÂ
Schleder, G. R. et al. From DFT to machine learning: Recent approaches to materials scienceâA review. J. Phys. Mater. 2(3), 032001 (2019).
ArticleÂ CASÂ Google ScholarÂ
Hermann, J., SchÃ¤tzle, Z. & NoÃ©, F. Deep-neural-network solution of the electronic SchrÃ¶dinger equation. Nat. Chem. 12(10), 891â897 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Njoku, I. et al. Approximate solutions of Schrodinger equation and thermodynamic properties with Hua potential. Results Phys. 24, 104208 (2021).
ArticleÂ Google ScholarÂ
Stanev, V. et al. Artificial intelligence for search and discovery of quantum materials. Commun. Mater. 2(1), 105 (2021).
ArticleÂ Google ScholarÂ
Bassani, F., Liedl, G. L., & Wyder, P. Encyclopedia of condensed matter physics (2005).
Li, S. et al. Critical temperature prediction of superconductors based on atomic vectors and deep learning. Symmetry 12(2), 262 (2020).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Wei, J. et al. Machine learning in materials science. InfoMat 1(3), 338â358 (2019).
ArticleÂ MathSciNetÂ CASÂ Google ScholarÂ
Rupp, M. Machine learning for quantum mechanics in a nutshell. Int. J. Quantum Chem. 115(16), 1058â1073 (2015).
ArticleÂ CASÂ Google ScholarÂ
Burke, K. Perspective on density functional theory. J. Chem. Phys. 136(15), 150901 (2012).
ArticleÂ ADSÂ PubMedÂ Google ScholarÂ
Frank, M., Drikakis, D. & Charissis, V. Machine-learning methods for computational science and engineering. Computation 8(1), 15 (2020).
ArticleÂ CASÂ Google ScholarÂ
Stanev, V. et al. Machine learning modeling of superconducting critical temperature. Npj Comput. Mater. 4(1), 29 (2018).
ArticleÂ ADSÂ MathSciNetÂ Google ScholarÂ
Kitchin, R. Big data, new epistemologies and paradigm shifts. Big Data Soc. 1, 1â12 (2014).
ArticleÂ Google ScholarÂ
Himanen, L. et al. Data-driven materials science: Status, challenges, and perspectives. Adv. Sci. 6(21), 1900808 (2019).
ArticleÂ Google ScholarÂ
Bishop, C. M. & Nasrabadi, N. M. Pattern Recognition and Machine Learning Vol. 4 (Springer, Berlin, 2006).
Google ScholarÂ
Lengauer, T. Statistical data analysis in the era of big data. Chem. Ing. Tech. 92(7), 831â841 (2020).
ArticleÂ CASÂ Google ScholarÂ
Gomez, C. et al. A contemporary approach to the MSE paradigm powered by artificial intelligence from a review focused on polymer matrix composites. Mech. Adv. Mater. Struct. 29(21), 3076â3096 (2022).
ArticleÂ Google ScholarÂ
Li, Z. et al. Machine learning in concrete science: Applications, challenges, and best practices. Npj Comput. Mater. 8(1), 127 (2022).
ArticleÂ ADSÂ Google ScholarÂ
Dan, Y. et al. Computational prediction of critical temperatures of superconductors based on convolutional gradient boosting decision trees. IEEE Access 8, 57868â57878 (2020).
ArticleÂ Google ScholarÂ
Roter, B. & Dordevic, S. Predicting new superconductors and their critical temperatures using machine learning. Phys. C Superconduct. Appl. 575, 1353689 (2020).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Quinn, M. R. & McQueen, T. M. Identifying new classes of high temperature superconductors with convolutional neural networks. Front. Electron. Mater. 2, 893797 (2022).
ArticleÂ Google ScholarÂ
Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8â12 (2009).
ArticleÂ Google ScholarÂ
Klettke, M. & StÃ¶rl, U. Four generations in data engineering for data science: The past, presence and future of a field of science. Datenbank-Spektrum 22(1), 59â66 (2022).
ArticleÂ Google ScholarÂ
Jackson, D. E. et al. Superconducting and magnetic phase diagram of RbEuFe₄As₄ and CsEuFe₄As₄ at high pressure. Phys. Rev. B 98(1), 014518 (2018).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Chikodili, N. B. et al. Outlier detection in multivariate time series data using a fusion of K-medoid, standardized euclidean distance and Z-score. In International Conference on Information and Communication Technology and Applications (Springer, 2021).
Zhao, Y., Nasrullah, Z., & Li, Z. Pyod: A python toolbox for scalable outlier detection. arXiv:1901.01588 (2019).
Hancock, J. T. & Khoshgoftaar, T. M. CatBoost for big data: An interdisciplinary review. J. Big Data 7(1), 1â45 (2020).
ArticleÂ Google ScholarÂ
Naheed, N. et al. Importance of features selection, attributes selection, challenges and future directions for medical imaging data: A review. CMES-Comput. Model. Eng. Sci. 125(1), 315â344 (2020).
Google ScholarÂ
SÃ¡nchez-MaroÃ±o, N., Alonso-Betanzos, A., & Tombilla-SanromÃ¡n, M. Filter methods for feature selectionâA comparative study. In International Conference on Intelligent Data Engineering and Automated Learning (Springer, 2007).
Rosely, N. F. L. M., Salleh, R. & Zain, A. M. Overview feature selection using fish swarm algorithm. In Journal of Physics: Conference Series (IOP Publishing, 2019).
Bagherzadeh, F. et al. Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance. J. Water Process Eng. 41, 102033 (2021).
ArticleÂ Google ScholarÂ
JoviÄ, A., BrkiÄ, K. & BogunoviÄ, N. A review of feature selection methods with applications. In: 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). (IEEE, 2015).
Matasov, A. & Krasavina, V. Visualization of superconducting materials. SN Appl. Sci. 2, 1463 (2020).
ArticleÂ CASÂ Google ScholarÂ
Chen, P., Li, F. & Wu, C. Research on intrusion detection method based on Pearson correlation coefficient feature selection algorithm. J. Phys. Conf. Ser. 1757(1), 012054 (2021).
ArticleÂ Google ScholarÂ
Xie, Z.-X., Hu, Q.-H. & Yu, D.-R. Improved feature selection algorithm based on SVM and correlation. In International Symposium on Neural Networks (Springer, 2006).
Khalid, S., Khalil, T. & Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Science and Information Conference (IEEE, 2014).
ToloÅi, L. & Lengauer, T. Classification with correlated features: Unreliability of feature ranking and solutions. Bioinformatics 27(14), 1986â1994 (2011).
ArticleÂ PubMedÂ Google ScholarÂ
Lundberg, S.M. and S.-I. Lee, A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30 (2017).
RodrÃguez-PÃ©rez, R. & Bajorath, J. Interpretation of machine learning models using shapley values: Application to compound potency and multi-target activity predictions. J. Comput. Aided Mol. Des. 34, 1013â1026 (2020).
ArticleÂ ADSÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Matasov, A. & Krasavina, V. Prediction of critical temperature and new superconducting materials. SN Appl. Sci. 2(9), 1482 (2020).
ArticleÂ CASÂ Google ScholarÂ
Uher, C. Thermal conductivity of high-T_c superconductors. J. Superconduct. 3, 337â389 (1990).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Maheshwary, P., Handa, C. & Nemade, K. A comprehensive study of effect of concentration, particle size and particle shape on thermal conductivity of titania/water based nanofluid. Appl. Therm. Eng. 119, 79â88 (2017).
ArticleÂ CASÂ Google ScholarÂ
Matasov, A.V. Characteristic lengths and Plasmon superconductivity mechanism of some high-temperature superconductors. In International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE) (IEEE, 2019).
Zhigadlo, N. D. & Puzniak, R. Spin-glass-like behavior in SmFeAsO_0.8F_0.2. Mendeleev Commun. 32(3), 305â307 (2022).
ArticleÂ CASÂ Google ScholarÂ
Tamegai, T. et al. Bulk and local magnetic properties of iron-based oxypnictide superconductor SmFeAsO₁_â_xF_x. J. Phys. Soc. Jpn. 77(3), 54â57 (2008).
ArticleÂ Google ScholarÂ
Hosono, H. et al. Exploration of new superconductors and functional materials, and fabrication of superconducting tapes and wires of iron pnictides. Sci. Technol. Adv. Mater. 16, 033503 (2015).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Owolabi, T. O., Akande, K. O. & Olatunji, S. O. Prediction of superconducting transition temperatures for Fe-based superconductors using support vector machine. Adv. Phys. Theor. Appl. 35, 12â26 (2014).
Google ScholarÂ
Zhang, Y. & Xu, X. Predicting doped Fe-based superconductor critical temperature from structural and topological parameters using machine learning. Int. J. Mater. Res. 112(1), 2â9 (2021).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Kudo, K. et al. Emergence of superconductivity at 45 K by lanthanum and phosphorus co-doping of CaFe₂As₂. Sci. Rep. 3(1), 1478 (2013).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ

Download references

Author information

Authors and Affiliations

Department of Physics, Isfahan University of Technology, Isfahan, 84156-83111, Iran
Hassan Gashmard,Â Hamideh ShakeripourÂ &Â Mojtaba Alaei

Authors

Hassan Gashmard
View author publications
You can also search for this author in PubMedÂ Google Scholar
Hamideh Shakeripour
View author publications
You can also search for this author in PubMedÂ Google Scholar
Mojtaba Alaei
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

H.G. wrote the main manuscript textÂ andÂ prepared all figures, packages, and web application. H.Sh. supervised this study. All authors reviewed and edited the manuscript. All the authors discussed the results and commented on the manuscript.

Corresponding author

Correspondence to Hamideh Shakeripour.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gashmard, H., Shakeripour, H. & Alaei, M. Predicting superconducting transition temperature through advanced machine learning and innovative feature engineering. Sci Rep 14, 3965 (2024). https://doi.org/10.1038/s41598-024-54440-y

Download citation

Received: 07 October 2023
Accepted: 13 February 2024
Published: 17 February 2024
DOI: https://doi.org/10.1038/s41598-024-54440-y

Subjects

Abstract

Similar content being viewed by others

From individual elements to macroscopic materials: in search of new superconductors via machine learning

Magnetic and superconducting phase diagrams and transition temperatures predicted using text mining and machine learning

3DSC - a dataset of superconductors including crystal structures

Introduction

Data

Data set

Cleaning the dataset

Dealing with missing and duplicated data

Dealing with problematic data

Data correction

Dealing with multiple temperatures reported for a single compound

Detecting outliers

Computational methods

Machine learning algorithm

Generating the feature space

Feature selection

Results

Identifying key features

Predicting the superconducting materialsâ T c values

Conclusion

Data availability

Code availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links

Predicting the superconducting materialsâ T _c values