Panchromatic Mining For Quasars: An NVO Keystone Science Application
Panchromatic Mining For Quasars: An NVO Keystone Science Application
Panchromatic Mining For Quasars: An NVO Keystone Science Application
=
) 1 (
1
where i P is the probability of an identification for the i
th
source, and S is a normalization factor that guarantees the total
probability calculated for a specific object (including the probability of no cross-catalog identification) will be unity.
+
=
N
j
N
j
j j
N
i
i
i
R R
R
R
S ) 1 ( ) 1 (
1
where N is the number of possible counterparts (i.e., the number of sources within the specified spatial proximity of the target
source). In this manner, we can determine the specific probability of association for all sources in our catalog based on
different astronomical insights. While not explicitly mentioned in this discussion, we can also use spectral classifications and
photometric redshift indicators to both reduce the number of likely optical and near-infrared counterparts (i.e., eliminate
optical sources which are known to be x-ray quiet), as well as improve the actual probability reliability in addition to simple
flux ratios (i.e., though the utilization of statistical distance indicators).
4.1.3. Spectroscopic Classification and Photometric Redshifts
With the advent of the Hubble Deep Field (Williams et al. 1997), determining redshifts for objects from broadband
photometry underwent a renaissance. Two complimentary approaches have gained popularity: spectral template fitting, and
linear regression. The first technique (e.g., Mazin and Brunner 2000, and references therein) uses model and observed
spectral energy distributions to generate a redshift-magnitude grid. The observed magnitudes for a galaxy are then used to
determine which spectral energy distribution from the grid of model values provides the best fit. The second approach (e.g.,
Brunner 1997, Brunner et al. 1999) utilizes the inherent clustering of the galaxy distribution in multidimensional flux space
(due to the discrete nature of allowed SED's) to empirically estimate redshifts. Both of these approaches have been shown to
reliable estimate redshifts ( 06 . 0 z ) for Z < 1.0 (see, e.g., Brunner et al. 1997) and ( 1 . 0 z ) for Z < 5 (Hogg et al.
1998).
For our purposes, however, these techniques have shortcomings. The traditional template fitting technique can only do as
good as our knowledge of the spectral properties of the sources under consideration, also know as template incompleteness.
On the other hand, the empirical technique requires a uniform set of training spectroscopic redshifts. Recently, however, a
hybrid technique has been developed, in which templates are empirically derived from the dataset under analysis (Budavari et
al. 2000). This approach is well suited to a federated optical, near-infrared dataset such as the one we are constructing (note
that we will have at least 6 different flux measurements for the majority of our sources), since we have an overwhelmingly
large number of galaxies with which to construct our eigen-spectral basis, which provides, simultaneously, both a redshift
estimate and a spectral classification.
Basically, this technique utilizes the fact that galaxy spectra can be represented by a small number of spectral energy
distributions (i.e., the principal components or eigenspectra):
=
i
i ie a F
where i a is the i
th
eigenvalue, and i e is the i
th
eigenspectra (Connolly et al. 1999).
With a minor modification to the standard template photometric redshift estimation technique, we can now use the eigenbasis
directly to estimate photometric redshifts. In particular, to determine a spectral classification for an astronomical source, we
need to optimally select a spectral energy distribution from a grid of redshifted template spectra and constant stellar spectra.
Algorithmically, we need to select the best fit to our data for different models, which can be easily done using chi-square
fitting.
The appropriate eigenspectra templates can be derived either from existing spectral templates or model spectra, or they can be
iteratively derived from the galaxy photometry directly. The second approach is what we are using in this project, as it can
directly incorporate stars, galaxies, and quasars in a unified approach. Fundamentally, this technique treats the photometric
observations as a very low-resolution spectrograph. However, since we have a very large number of sources at different
redshifts, we actually have more than enough information to actually reconstruct the appropriate eigenspectral basis for our
sample. As a starting point, we will utilize the Coleman, Weedman, and Wu galaxy spectral templates (Coleman et al. 1980),
selected stellar templates from the Pickles library (Pickles et al. 1998), as well as synthetic quasar spectra (cf.
Hatziminaoglou et al. 2000 for a more thorough discussion of generating quasar spectral energy distributions and quasar
photometric redshifts).
Photometric redshift estimates can also utilize information from other wavelength regions, such as the radio to infrared flux
ratio (Helou et al. 1985). However, the most important caveat to keep in mind when using photometric redshifts is that they
should be used as statistical distance indicators. As a result, we define the probability density function, P(z), for an individual
galaxy's redshift to be a Gaussian probability distribution function with mean () given by the estimated photometric redshift
and standard deviation () defined by the estimated error in the photometric redshift (SubbaRao et al. 1996).
=
2
2
2
) (
2
1
) (
z
e z P
The estimated error in the photometric redshift can be determined either by measuring the intrinsic dispersion in a given
technique, or by generating bootstrap Monte-Carlo ensembles (e.g., Brunner et al. 1999). In this way, we can reliable
estimate the probability of a given galaxy being within a specific redshift range.
4.2. Knowledge Discovery
Now that the details of the data federation process have been explored, how do we anything useful with all of this data? This
process is generally known as Knowledge Discovery in Databases (KDD). Fortunately for us, other groups are working on
these problems. For example, the computational astrostastics group at Carnegie Mellon University has developed a cluster
finding code that first identifies clusters in multidimensional data, and then assigns a value to all data vectors based on their
distance from the clusters (Nichol 2001). This technique is extremely valuable in identifying and characterizing outliers in
color space for subsequent follow-up (e.g., Type-2 Quasars or High Redshift Quasars). Given an arbitrary parameter space,
what we wish to determine is the clusters that naturally occur in this particular space. We also would like to find any isolated
groups or holes in the clusters, as well as any isolated points or even isolated small clusters. This process of density
estimation is one example of the power of collaborations between computer scientists, statisticians, and astronomers. We
strongly encourage such partnerships as we feel they are vital to the successful implementation of virtual observatories. Other
areas that should prove ripe for collaboration include performance improvements in existing algorithms, supervised and
unsupervised classification, Bayesian Nets, genetic algorithms, and implementations that can utilize computational grids.
In Figure 4, the general topology of our planned
framework is displayed, showing the connection
between the data federation facilities and the
astrostatistics services (such as the CMU density
estimator). One promising mechanism for
connecting the data facilities with the statistical
algorithms is to employ web services. In this
approach, the implemented algorithms are
wrapped as SOAP servers, and are deployed
within the framework.
Another potential model is to employ Enterprise
JavaBeans to wrap the algorithms. This model
could easily be deployed within a JavaSpace to
take advantage of distributed compute services
resulting in a dynamic computational grid. As a
result, we are closely monitoring the JXTA
project from SUN Microsystems, which has
developed a peer-to-peer library in Java.
5.Conclusions
Currently the Skyserver facility consists of a single Microsoft SQL Server 2000 installation with the data of interested
completely collocated. Clearly this is not an optimal solution, and goes against our stated design goals. As a result, one of our
next tasks is to construct a mirror of the existing Skyserver installation, which will be located at Microsofts Bay Area
research campus under the direction of Jim Gray. Currently the data holdings include the SDSS EDR, the 2MASS second
incremental release, the NVSS survey, the FIRST survey, the ROSAT survey, and the IRAS survey. In addition, all data
federation is currently performed using spatial proximities (via HTM of course). Obviously this is not satisfactory, for the
many reasons mentioned earlier in this proceedings, and probabilistic associations will be implemented, initially as stored
procedures within SQL Server, but other options are also being explored.
While we currently have limited community access, primarily because we do not have the resources, we do plan on
eventually opening up the entire facility, possibly within the context of the NASA Extragalactic Data Facility (NED).
This proceeding describes the Skyserver project, which is a Panchromatic Quasar Data Mining project. The science use cases
for the Skyserver project form several key science drivers for the creation of Virtual Observatories. More information can be
found at the projects webpage (www.skyserver.org). While important, this project is focusing on only one small part of the
problem arising from the data Tsunami we are facing. Other projects are either on the drawing board, or underway. One
example of a successful project that is working on the visualization aspects of a virtual observatory is virtualsky.org, which
provides novel visualization of massive image archives. The combination of these different projects results in a whole that is
greater than the sun of its parts, and provides enormous gains towards the ultimate construction of virtual observatories.
Cross-ID
Toolkit
Cross-ID
Toolkit
Astrostatistics
Web Services:
Cluster Finding
Astrostatistics
Web Services:
Cluster Finding
Infrared Data Infrared Data
Data Federation
Optical Data Optical Data
Figure 4: A diagram showing the topology of our planned
framework for federating disparate data and exploring the
resulting parameters spaces.
ACKNOWLEDGEMENTS
We are grateful to all of our collaborators from around the world who share our vision, in particular the other members of the
Digital Sky project. Particular thanks are given to George Djorgovski, Tom Prince, and Alex Szalay who have inspired and
assisted with much of what is described in this proceeding, and to Jim Gray for both financial and moral support. In addition,
we wish to thank NASA and NSF for their encouragement in difficult times, and both SUN Microsystems and Microsoft
Research for their support. Finally, we would like to explicitly acknowledge financial support from NASA grants NAG5-
10885 and NAG5-9482.
REFERENCES
1. Brunner, R.J., 1997, Ph.D. Thesis, The Johns Hopkins University.
2. Brunner, R.J., Connolly, A.J., and Szalay, A.S., 1999, ApJ, 516, 563.
3. Budavari, T., et al. 2000, AJ, 120, 1588.
4. Coleman, G.D., Wu, C.-C., & Weedman, D.W. 1980, ApJS, 43, 393.
5. Connolly, A.J. et al. 1999, Photometric Redshifts and High-Redshift Galaxies, ASP Conference Series number 191. 13.
6. Djorgovski, S., et al., 1998, New Horizons from Multi-Wavelength Sky Surveys, IAU Symposium No. 179, 424.
7. Hatziminaoglou, E., et al. 2000, A&A, 359, 9.
8. Helou, G., Soifer, B.T., and Rowan-Robinson, M, 1985, ApJ, 298, L7.
9. Hogg, D. et al. 1998, ApJ, 499, 555.
10. Lonsdale, C.J, et al., 1998, New Horizons from Multi-Wavelength Sky Surveys, IAU Symposium No. 179, 450.
11. Madau, P. in Cosmic Star Formation Near and Far, 481.
12. Mazin, B., and Brunner, R.J., AJ, 120, 2721.
13. Mazzarella, J., et al. 1999, B.A.A.S., 194, 8305.
14. Nichol, R.C. 2001, private communication.
15. NVO White Paper, 2001, http://www.astro.caltech.edu/nvoconf/white_paper.pdf.
16. Padovani, P., 1998, New Horizons from Multi-Wavelength Sky Surveys, IAU Symposium No. 179, 257.
17. Odewahn, S.C., et al. 1998, B.A.A.S., 193, 0201.
18. Osmer, P., 1998, New Horizons from Multi-Wavelength Sky Surveys, IAU Symposium No. 179, 249.
19. Pickles, A.J., 1998, PASP, 110, 863.
20. Rutledge, R., Brunner, R.J., Prince, T., & Lonsdale, C. 2000, ApJS, 131, 335.
21. Samet, H. 1990, The Design and Analysis of Spatial Data Structures, Addison Wesley.
22. Shanks, T., and Boyle, B.J., 1994, M.N.R.A.S, 217, 753.
23. SubbaRao, M.U., et al., 1996, AJ, 112, 929.
24. Szalay, A.S., & Brunner, R.J. 1999, Astronomical Archives of the Future: A Virtual Observatory, Future Generations of
Computer Systems, 16, 63.
25. White, N., et al., http://lheawww.gsfc.nasa.gov/users/white/wgacat/wgacat.html
26. White, R., et al. 2000, ApJS, 126, 133.
27. Williams, R.E., et al., 1996, AJ, 112, 1335.