ABSTRACT
Gaia is the next astrometric ESA mission, conceived to extend the Hipparcos legacy by producing what has been called the first stereoscopic census of the Galaxy. The spacecraft will be launched by spring of 2012 and will measure astrometry with unprecedented accuracy for a significant 1% of the objects in the Milky Way. Additionally, two spectrophotometers will determine the spectral energy distributions (SEDs) of the objects in the region of 0.3–1 µm, and a radial velocity spectrograph (RVS) will measure the kinematics of the brightest objects (down to 17 mag). The Gaia RVS will provide spectra in the near-IR Ca II triplet region with an expected signal-to-noise ratio (S/N) between 100 and 20 for FGK stars with visual magnitude between 8 and 15. In order to deal with the enormous volume of data that the mission will generate, automated specialized analysis tools are being developed by the mission scientific Data Analysis and Processing Consortium (Gaia DPAC). In particular, we have been testing several analysis techniques in order to be prepared to extract all possible astrophysical information from RVS stellar spectra. A combination of data processing in transformed domains (Fourier analysis and wavelet multilevel decomposition) and connectionist systems (artificial neural networks, ANNs) has proven to be a good approach to derive the fundamental stellar parameters, Teff, log g, [Fe/H], and [α/Fe], on the basis of RVS synthetic spectra blurred with noise at different S/N. Signal-processing techniques allowed us to estimate and categorize the S/N, which in turn is found to be essential since the optimal algorithm for parameterization is highly dependent on S/N. In the case of low S/N (5–25) spectra, it is found that the wavelet transform provides a competitive approach for parameterization. The derivation of the stellar parameters is performed by the use of ANNs trained with the error backpropagation algorithm. The accuracy of the parameters' derivation is presented for typical Galaxy populations.
1. INTRODUCTION
The first studies aimed at developing a self-consistent model of the Milky Way's structure, formation, and evolution were based on star counts along different lines of view. In a pioneering work, Bahcall & Soneira (1980) proposed a two-component model (disk and halo), and in 1983, Gilmore & Reid (1983) found out that the disk could be further divided into two parts, thin and thick, for better fitting. Since then, a general consensus has been reached on the basic stellar components of the Galaxy, which are the thin disk, thick disk, stellar halo, and central bulge, although the relationships and distinctions among the different components remain a subject of debate (Schonrich & Binney 2009).
An approach only based on star counts presents limitations due to the fact that the star's apparent magnitudes distribution depends on other factors such as the luminosity function (and its reliance on metallicity), the initial mass function, and the star formation rate. Regardless of these constraints, star counts have proven to be an efficient method to delineate the great scale structure of the various stellar components of the Galaxy (Wyse 2006 and references therein).
The introduction of radial velocities, proper motions, and parallaxes, starting with samples in the solar neighborhood, enabled researchers to study the details of stellar composition and distribution in the different galactic structures. The development of the first massive spectroscopic surveys (RAVE, SDSS/SEGUE) allowed the determination of metal contents (and α-elements' noncanonical abundances) and the study of the diverse morphological structures with more reliability than the mere use of photometric indices (Twarog et al. 2002; Allende et al. 2004).
The first works on stellar populations in the Galaxy in the complete space of parameters (distance, atmospheric parameters, proper motions) were undertaken in the first third of the twentieth century, applying photometric observations in stellar clusters, both galactic and globular. These studies could be extended to field stars after the Hipparcos astrometric measurements became available (Casertano et al. 1999; Sowell et al. 2007; Bacham et al. 2006).
Stellar physico-chemical parameterization represents a major step for the understanding and modeling of the Galaxy and its components. The study of the metal content and chemical abundances of wide samples of stars is providing evidence on the history of stellar formation and information on chemical enrichment from previous populations with unexpected detail (Bacham et al. 2006).
The Gaia ESA space mission was designed as a primarily astrometric mission that will extend the Hipparcos inheritance in several orders of magnitudes, both in the astrometric precision as in the number of observed sources. But Gaia is a complex project that was conceived to provide much more information about the Galaxy and its vicinity. It will carry on two spectrophotometers that will measure the SEDs of the observed sources (between 330–1050 nm) and will allow the determination of their physical nature. It will also be equipped with RVS, a radial velocity spectrograph designed to determine radial velocities and stellar parameters until an approximate magnitude of 17 (Katz 2009) with a resolution R = 11500, and an operative wavelength range around the near-IR Ca II triplet (847–874 nm). Our study focuses on the preparation of the automated analysis tools for the RVS survey, aimed at the development of automated algorithms for the parameterization of the fundamental properties of stars that could enable us to efficiently analyze the extensive volume of stellar spectra that the Gaia RVS instrument is expected to provide.
During the previous four years, our group was involved in the Gaia scientific team as member of the Gaia DPAC coordination Unit 8, "Astrophysical Parameters," responsible for classification tasks (Mignard et al. 2008). We tested different algorithms based on artificial intelligence (AI) techniques for the extraction of physical parameters of stars by means of synthetic stellar spectra in the spectral region of RVS, specifically calculated for the Gaia instrument (Recio-Blanco et al. 2006). Preliminary results (Ordóñez et al. 2010) showed that artificial neural networks (ANNs) in different data domains represent a very competitive and robust method for such a derivation. In the particular case of low S/N spectra, a combination of wavelet transforms and ANNs allows the derivation of the main atmospheric parameters with acceptable accuracy and statistical significance. We present and discuss both the methodology and the accuracies of the parameters obtained for several typical stellar populations of the Galaxy.
2. RVS SYNTHETIC SPECTRA AND TRANSFORMED DOMAINS
The RVS is a near-infrared ([847,874] nm), medium-resolution spectrograph, R = 11500 (Katz 2009). Like the other Gaia instruments, the RVS will repeatedly scan the celestial sphere: over the course of 5 yr, the RVS will observe a source 40 times on average. From the measurements of unfiltered (white) light, Gaia will produce what are called G magnitudes (with G representing the very broad spectral response of Gaia, ≃330–1050 nm), while the spectral energy distribution of each source will be sampled by a dedicated spectrophotometric instrument providing integrated flux in the blue and red spectrophotometers' broad passbands, BP and RP. In addition, the RVS instrument will disperse the light in the range of 847–874 nm, for which it will include a dedicated filter G (RVS). RVS will be operated in windowed mode, as will the other Gaia instruments. The windows are 1104 pixels long by 10 pixels wide. The length of the windows includes pixels from the filter bandwidth to measure the background. The effective number of pixels is 971, and the dispersion per pixel is 0.26 Å, with 2 pixels per resolution element. Some binning will be performed for the fainter stars in order to reduce the read-out noise as well as the telemetry flux (Katz 2009).
Gaia DPAC has been structured in several coordination units, being CU8 "Astrophysical Parameters" in charge of the classification and parameterization of the astronomical sources (Mignard et al. 2008). CU8 is organized in several working packages with a hierarchical structure, being the top-level analysis package DSC (discrete source classifier, Smith et al. 2008). DSC classifies point-source objects (extended sources will not be observed by Gaia) by assigning a probability of membership to each of a number of predefined classes: single star, physical binary star, nonphysical multiple system (star/star, star/nonstar, or nonstar/nonstar), quasar, unresolved galaxy, solar system object, and unknown (UFO). No input catalog will be used, so automated classification of detected sources is a key part of the data-processing process. Our work focuses on the parameterization of single star spectra, which, for the effects of these tests, are supposed to be previously selected from the bulk of all of the RVS Gaia observations.
The ensemble of RVS simulated data produced for the testing and performance assessment of the data parameterization algorithms have been calculated by A. Recio and P. de Laverny from the Nice Observatory (Recio-Blanco et al. 2006), using MARCS one-dimensional, plane-parallel, and spherical LTE model atmospheres (Gustafsson et al. 2008). The set of values of the atmospheric parameters considered in the models' computation is presented in Table 1. For this grid of model atmospheres, synthetic spectra were computed in the range of 8470–8740 Å with a step of 0.02 Å, and convolved and resampled to match the RVS spectral resolution. The synthetic spectra were calculated with the same geometry and abundances as those of the model atmosphere. Elements considered to be α elements are O, Ne, Mg, Si, S, Ar, Ca, and Ti.
Two ensembles of spectra were computed: a "nominal" ensemble that included all the possible parameters combinations, and a "random" set calculated by randomized interpolation between nominal atmospheric models. Noise was taken into account by the consideration of a simple model of additive white noise at different S/N levels (S/N: 5, 10, 25, 50, 75, 100, 150, 200, 10,000, and clean).
The derivation of the stellar parameters is performed by means of ANNs trained with the error backpropagation algorithm. A preprocessing stage, prior to the proper process of parameterization of the stellar spectra, was included in order to refine algorithm performance, especially for the case of low S/N. Furthermore, the fact that the spectral features that are sensitive to the different parameters (Teff, log g, metallic lines, molecular bands, and α-element lines) are broad or narrow, together with the consideration that they are located at specific wavelengths along the spectrum, suggested that a Fourier transform and/or a multilevel wavelet decomposition could provide good results for a selective filtering of the information.
The data domains that were considered for parameterization with ANNs were the following:
- 1.The spectrum in wavelength: in this case, the preprocessing consists only of normalizing and scaling the spectra in the [0,1] interval, so that they can be used as input signals for the ANNs.
- 2.The result of transforming the spectra into "frequencies" by applying the FFT (Cooley & Turkey 1965). The standard procedure was followed by using a Hamming window to avoid aliasing.
- 3.The application of the wavelet transform (Meyer 1989) to the spectra by means of a Mallat decomposition of the signals (Mallat 1989) in several orders of "approaches" and "details"; the former are signal components with high scales and low frequencies, the latter are components with low scales and high frequencies. The concept of multilevel analysis is illustrated in Figures 1 and 2, and refers to the repeated application of filtering to each of the successive signal approaches, obtaining a new level after each filtering stage. We opted for the Daubechies mother wavelet (Daubechies 1988), which is widely used in digital filtering.
The dimensions of the input series vary when transformed domains are considered. The RVS simulate spectra consist in series of 971 normalized fluxes, the application of FFT to these series yielded new input samples of roughly half the points. From these, only the central 335 points were retained. In the case of the multiresolution analysis with wavelets, as we descend the filtering sublevels, the number of signal points also decreases with a factor of approximately 0.5, which results in signals with the amount of points indicated in Table 2. We decided to unfold the analysis into five different approach and detail levels. High-order filtering produces nonentity signals. Figure 2 shows one sample spectrum and its wavelet decomposition up to level three of approximations and details.
3. OPTIMIZING THE STELLAR PARAMETERIZATION AS A FUNCTION OF THE S/N
The algorithm used to perform the parameterization of the main astrophysical properties from stellar spectra in the RVS region is an ANN. In the field of computational astrophysics, and in particular in studies focused on the classification and derivation of stellar parameters from large surveys of spectroscopic or spectrophotometric databases, ANNs have already proven their suitability. Most authors have used multilayer networks (MLP) with supervised learning algorithms and backpropagation training error. Examples hereof are the already classical works of von Hippel et al. (1994); Singh et al. (1998); Bailer-Jones et al. (1997, 1998); Vieira & Ponz (1995), together with the more recent works by Gupta et al. (2004); Willemsen et al. (2005); Giridhar et al. (2006); Bazarghan & Gupta (2007). A good overview of the errors that can be expected from the use of different techniques can be found in Allende et al. (2004). Even though in general terms the obtained results are acceptable within observational and model error margins, we believe that the challenge of automatically parameterizing and classifying an unprecedented quantity of sources of diverse natures and S/N, as is the case in the Gaia project, is better met by the combined use of different techniques in the so-called "hybrid systems."
In other research fields, such as the processing of biomedical signals, electric distribution systems, the processing of digital images, etc., the integration of techniques is often the solution to specific problems. Our aim is to apply this methodology by integrating AI techniques (ANNs and knowledge-based systems) and classical preprocessing procedures such as principal components analysis (PCA), genetic algorithms, and transformed domains (FFT and wavelet decomposition). Some of these approaches have already been tested for the problem of automated classification of stellar spectra in the MK system (Rodríguez et al. 2004, 2008).
3.1. Stellar Parameterization by ANNs
Feed-forward networks with three layers (input, hidden, output) and trained with the error backpropagation algorithm (Rumelhart et al. 1986) are well suited for the problem of spectra parameterization. A few authors have considered the implementation of more sophisticated ANN algorithms and architectures, with no significant improvements in the reported network performance (Carricajo et al. 2004). Still, one has to bear in mind that backpropagation networks have well-known problems such as their strong dependence on the initial training values (which implies testing various trainings), long training times, the "catastrophic interference" of new training patterns, weights saturations, temporal instability, the problem of "local minima" that is inherent to the use of the descending gradient in the error surface, etc., because they constitute a nonlinear approach to a problem in which the data are affected by noise.
We have handled these difficulties by using XOANE (Ordóñez et al. 2007), a self-developed tool that allows us to control the tuning of the network's generalization point, and consequently, the ANN weights configuration that provides the minimum fitting residuals. XOANE, the eXtensible Object Oriented Artificial Neural Networks Engine, allows us to easily shape the network architectures, training algorithms, and tests in JAVA as required by the Gaia mission. Another advantage with respect to more popular tools (e.g., Matlab) consists in shorter execution times, an important consideration given the huge data volume involved in the data mission calculations. The configuration with three network layers (input, hidden layer, and output) is very frequently used for many purposes, not only to solve parameterization problems, and has been proven to respond to the present work purpose.
Our experiment requires a rack of four servers equipped with two Intel Xeon Quad Core processors and 16 GB RAM each. This hardware architecture allowed us to launch a total of 32 parallel trainings (eight per computer) without a significant impact on the equipment productivity. The time needed to finish a training varies from 1 day in the case of the wavelength dominium, to several hours for the rest of the data formats.
3.2. S/N Estimation
The reliability of a parameterization algorithm manifestly depends on the wavelength coverage, the spectral resolution, and the noise intensity (Bailer-Jones 2000). With the first two conditions fixed a priori by the instrument, the treatment of the noise and the configuration of the information extraction algorithm determine the extraction quality of the parameters. In this section we are interested in showing how the performance of an ANN-based algorithm highly depends on the S/N ratio of the RVS spectrum to be parameterized, and also in illustrating how in the current working hypothesis, i.e. synthetic spectra and additive Gaussian white noise, the S/N can be estimated.
In this phase of the project, in which we wait for a definitive simulation of RVS instrumental noise, gaussian white noise was added to RVS simulated spectra. We know that noise of this nature has a spectral density that is continuous in all the frequencies. Unlike noise, the Fourier transform of any clean spectrum is band-limited and even canceled at high frequencies. This fact has been used to taxonomize the noise and predict its intensity. When the instrumental noise is finally characterized by the Gaia development team, it will be further considered in the implementation of the algorithms for parameter derivation. Figure 3 shows the noise behavior for the highest frequencies. It represents the FFT amplitude as a function of frequency for three spectra with different S/N. The last 180 points (about the last third) of the Fourier amplitude spectrum are shown in the zoomed frame. If we calculate the mean integral value for the FFT signal for the spectra in the nominal ensemble with a given S/N, we find a characteristic distinct value for each noise level, which can be used to approximate the S/N of a particular spectrum. We decided to categorize the spectra in four noise samples, low (S/N = 5, 10, and 25), medium (S/N = 50, 75, and 100), high (S/N = 150, 200), and almost-clean (S/N = 10,000, infinite). This noise coding allows us to consider the binomial "input domain–S/N-specific algorithm" to optimize the algorithm of parameterization.
4. RVS SYNTHETIC SPECTRA PARAMETERIZATION RESULTS
4.1. Parameterization Internal Errors as a Function of S/N and Signal Domain
An extensive ensemble of experiments was carried out, training the networks with the nominal grid of synthetic spectra, and testing the random set of spectra (see § 2). The networks were trained with spectra in wavelengths, with the FFT transformed domain, and with 10 wavelet domain signals (5 approximations and 5 details levels). We chose to empirically determine the data domain best suited for extracting each of the four stellar parameters. Better results were achieved when training and testing were performed on spectra with the same S/N range value (see § 3.2). The results of the experiments are presented in Figures 4 and 5. While Figure 4 shows the mean absolute errors (bias) achieved in the derivation of Teff as a function of the ANN signal dominium considered, Figure 5 shows the mean errors obtained in the derivation of the four atmospheric parameters as a function of the S/N.
We found that nearly clean spectra (S/N 10,000 and cleaner) are better parameterized with networks trained in the FFT domain, whereas spectra with high and medium S/N (50–200) are better parameterized in the wavelength domain. In the case of very low S/N (5,10, and 25), the best results are achieved with wavelet signals (approximation level A1), with an improvement in the parameterization of about a 10% in log g and a 6% in Teff with respect to the parameterization in the classical wavelength domain. For low S/N spectra, we obtain a chemical parameterization ([Fe/H] and [α/Fe]) that shows similar errors both in wavelength and wavelet domains.
We would like to point out that we would expect to find a better sensitivity of wavelet decomposition in details to high-frequency features, containing information about metallicity and α-elements, at least for lower noise levels. What we found, however, is that either the original spectral signal (wavelength) or the first wavelet approximation (A1) for noisy spectra are better suited to derive those stellar parameters.
Our sample of synthetic spectra covers disk and halo FGK stars, the spectral types for which the RVS design was optimized. We are interested in delimiting the stellar types and noise level for which a statistically significant derivation of the main atmospheric parameters can be obtained. The statistical parameters (mean, bias, sigma, and upper quartile) of the internal error distributions obtained in the parameterization of each couple (best suited domain–S/N) and for each of the four stellar parameters are presented in Tables 3 and 4. The reported errors are meaningful values, for the most part, of such noise levels if compared with the parameter dispersion of our synthetic grill (250 K, 0.5 dex for log g, 0.25 dex for [Fe/H], and 0.2 dex for [α/Fe]). It can be inferred that an estimation of the S/N range of the input spectra is a mandatory preliminary step that allows us to select the input domain and the network that provide the best result for a given parameter.
4.2. Parameterization of Galactic Populations
Once the S/N is estimated and the optimal data domain to train the ANNs is selected, we can study the performance of our algorithm for the derivation of the atmospheric parameters in different galactic populations. Dwarf or subdwarf low mass stars can be considered as representative tracers of the different galactic structures (Bochanski et al. 2007). They are distributed along the whole Galaxy and have long lifetimes, similar to those of the Galaxy itself, and their number at different metallicities reflects the structure and history of star formation in the Galaxy (Allende et al. 2006). Our sample of synthetic spectra is limited to FGK types, those that will constitute the most numerous Gaia targets, and expand a variety of ages that can allow us to further derive the age-metallicity relation in the different galactic populations.
The measurement of stellar parameters in spectroscopic samples, with a number of stars that is high enough to be considered as representative (Allende et al. 2006; Bochanski et al. 2007), has proven that it is possible to disentangle the disk and halo populations from the study of their metallicities. In order to further separate the thin and thick disk populations, their kinematical properties must be assessed, as both components show common values of the metal content. And, precisely, Gaia will determine the three-dimensional kinematical properties of stars. Kinematics and chemical abundances complement our view of the Galaxy, allowing not only structural properties to be inferred, but also clues about the formation and evolution of the Galaxy components, in what has been named Galaxy archeology. Gaia will allow us to better determine the disk structural parameters such as its scale height and tilt, the distribution of the iron abundances values, and possible galactocentric gradients in both the thin and thick disk components.
In order to demonstrate the performance of our algorithm for the derivation of the atmospheric parameters on a statistically significant sample of spectra representing the main three galactic populations, three samples from the grid of synthetic spectra were selected:
- 1.Metal-rich low-mass stars, representative of the thin disk component, with values of [Fe/H] in the [-0.75,+0.25] dex interval and no α enhancement.
- 2.Intermediate metallicity low mass stars tracing the thick disk, with [Fe/H] belonging to the [-1,-0.25] dex interval. The full range of [α/Fe] noncanonical abundances in the synthetic spectra was considered.
- 3.Highly metal-poor dwarfs and giants, representative of the halo. [Fe/H] in the [-3,-1] dex interval, logg in [1,5] dex and all possible [α/Fe] values.
Our purpose is to determine the accuracy of the parameters that can be obtained with our algorithm, for each of the stellar components. Again, we calculated the statistical parameters of the internal error distribution as a function of S/N and galactic population. The distribution of errors and main statistical parameters for the three populations, for three levels of S/N, and for the four atmospheric parameters are presented in Figures 6, 7, and 8. Statistically meaningful information can be obtained for both disk populations at S/N as low as 25, 10, and even 5. In the case of halo stars, high noise and the absence of significant lines prevent the derivation of the atmospheric parameters for S/N lower than 50.
5. CONCLUSIONS
This paper introduces the use of ANN trained in transformed domains for the derivation of stellar atmospheric parameters from medium resolution spectra in the optical region. An extensive ensemble of tests have been performed, applying synthetic spectra calculated in the domain of the Gaia RVS instrument. The Fourier transform is found to be an excellent choice in the case of spectra with very high S/N, with the added advantage that the networks are trained with less than half of the points. Wavelet filtering also reduces the dimension of the signal to be processed and provides competitive results for very low S/N spectra (25 and lower). Our approach can provide good results in the derivation of physical information from medium and high-resolution spectroscopy with a variety of S/N, provided that a sample of template spectra can be used to train the networks.
One of the principal goals of the Gaia project is to increase our knowledge on the nature of the stars that configure the structural components of our Galaxy. Gaia will provide us with an enormous amount of data. Astrometry and the SEDs of approximately 1% of its stars will be measured with unprecedented accuracy. For stars in the magnitude range 6–17, spectra in the near-IR Ca II region will be obtained, allowing the accurate determination of radial velocities and the main atmospheric parameters. Reliable algorithms for the automated analysis of this information have to be implemented in order to get the maximum advantage of the data. While work is being done by Gaia project DPAC to characterize the RVS detector response, readout noise, and correction of overlapping frames, the tests presented here are useful to determine optimized algorithms for the automated derivation of the main stellar parameters in the instrument domain. Work in this direction is ongoing.
This work was supported by the Spanish Ministry of Education and Sciences, project numbers ESP-2006-13855-C02-02 and AYA2009-14648-C02-02, which are partially supported by FEDER funds. D. O. B. thanks the staff from the Nice Observatory for their hospitality during a work stay. We also thank Alejandra Recio-Blanco and Patrick de Laverny from the Nice Observatory for sharing their ensemble of RVS synthetic spectra.