XYLab: an interactive plotting tool for mixed multivariate data observation and interpretation

Ramazzotti, Matteo; Monsellier, Elodie; Degl’Innocenti, Donatella

XYLab: an interactive plotting tool for mixed multivariate data observation and interpretation

Bioinformation, 2008

Bioinformation by Biomedical Informatics Publishing Group open access www.bioinformation.net Prediction Model _____________________________________________________________________________ ISSN 0973-2063 Bioinformation 2(9): 392-394 (2008) Bioinformation, an open access forum © 2008 Biomedical Informatics Publishing Group 392 XYLab: an interactive plotting tool for mixed multivariate data observation and interpretation Matteo Ramazzotti 1 , Elodie Monsellier 1 and Donatella Degl’Innocenti 1, * 1 Department of Biochemical Sciences, University of Florence, Italy; E-mail: matteo.ramazzotti@unifi.it; *Corresponding author received March 04, 2008; revised May 23, 2008; accepted May 27, 2008; published July 03, 2008 Abstract: The correct display of data is often a key point for interpreting the results of experimental procedures. Multivariate data sets suffer from the problem of representation, since a dimensionality above 3 is beyond the capability of plotting programs. Moreover, non numerical variables such as protein annotations are usually fundamental for a full comprehension of biological data. Here we present a novel interactive XY plotter designed to take the full control of large datasets containing mixed-type variables, provided with an intuitive data management, a powerful labelling system and other features aimed at facilitating data interpretation and sub- setting. Availability: XYLab program, test dataset and manual is available at www4.unifi.it/scibio/bioinfo/ XYLab.html Keywords: multivariate data; scatter plot; labels; search; subset Background: Multivariate data set can be defined as a set of observations for each of which a number of variables are present. Nowadays, life science researchers are often in contact with such data set generated by bioinformatics programs or high-throughput applications such as microarray technology or mass spectrometry-based proteomics [1]. In many cases, numerical data are mixed with textual ones, e.g. deriving from protein or gene databanks like those at EMBL or NCBI, just to cite the two major collections, or emerging from functional databases such as GO (Gene Ontology) [2] or KEGG pathway [3]. Numerical variables can be treated with multivariate statistics in order to reduce the dimensionality of the full sample dataset and to locate the most prominent trends. Clustering strategies are also very useful for their ability to group multivariate data into subcategories with homogeneous features [4]. For such purposes a number of packages are available, including generic mathematical packages such as R, MATLAB, SPSS or SYSTAT or dedicated applications such as CLUSTER [5]. The representation of multidimensional datasets reaches its higher level of complexity with three-dimensional plots, which in many applications may have scattered points with colours/dimensions proportional to other variables and may be rotated. In addition, labels containing additional data can usually be appended to points. The above mentioned packages are provided with excellent data display capabilities, and a number of stand alone applications can be used to represent multidimensional plots (e.g. Graphis, Voxler). Nevertheless, such graphical complexity is usually targeted to the finest representation of selected data and not intended for routine interpretation tasks, when a multitude of variables is to be screened and evaluated. Besides, the generation and management of such graphs require time and an expertise level that is not common among experimentalists. For routine usage, two-dimensional (XY) scatter plots are the most used plot type and, despite their simplicity, they can offer details that are not evident in other approaches. On the other hand, they segregate multivariate data in pair-wise spaces, thus increasing the number of plots and frequently making them hard to coordinate and, eventually, to re-unify. Software We propose a minimalist approach that addresses the problem of multidimensionality in an intuitive fashion. We developed a highly interactive, one-window-based plotting tool (XYLab, see Figure 1) that loads data from simple column-based tables to build up an XY scatter plot with per-point pop-up labels. The plot area is controlled by three easily accessible selectors, named “X”, “Y” and “Lab” since columns containing numerical variables are automatically detected and used to feed the first two selectors, while the latter may also contain non- plottable variables such as text-based ones. XYLab is aimed at offering to the user an easy-handling, fast and full control of what to plot and which labels to show: a simple change in the variable choice from the selectors makes an update of the plot with automatic rescaling and optimization. This allows to visually explore a number of data trends and interrelations in minutes. The presence of a dedicated and real-time effective “Lab” selector makes the labels readily tuneable and grants them an importance similar to the X and Y coordinates, instead of being relegated to a less accessible plot option as usually happens in other applications. To exploit at best this label integration, XYLab implements a search-in-plot procedure: in practice, we introduced a text box that is read before plotting the points and that may contain a query directed against the variable selected in the “Lab” selector. Such query can be verbose, acting on text-based labels, or numerical (e.g. greater/lesser than), acting on numerical labels. All the

Bioinformation by Biomedical Informatics Publishing Group open access www.bioinformation.net Prediction Model _____________________________________________________________________________ ISSN 0973-2063 Bioinformation 2(9): 392-394 (2008) Bioinformation, an open access forum © 2008 Biomedical Informatics Publishing Group 393 positive matches are scored directly in the plot by changing the point appearance, without affecting their position in the Cartesian space. The deep integration with labels we implemented can be extremely useful when considering microarray or proteomic data; for example it may allow to highlight the elements involved in the same metabolic pathway, searching for integrated expression profiles [6]. This approach was used in a recent work of our group to draw observations about Codon Adaptation Index (CAI) of ribosomal proteins in different bacterial subdivisions [7]. Figure 1: A snaphot of XYLab one-window-based interface. The program has three main sections: the interactive plot area at the top, where scatterplots are displayed, the copy area at center, where information about copied points is displayed and the control area in the bottom containing the three plot controllers, the “Find” box and other elements that coordinate the plot interaction. The presence of a dedicated and real-time effective “Lab” selector makes the labels readily tuneable and grants them an importance similar to the X and Y coordinates, instead of being relegated to a less accessible plot option as usually happens in other applications. To exploit at best this label integration, XYLab implements a search-in-plot procedure: in practice, we introduced a text box that is read before plotting the points and that may contain a query directed against the variable selected in the “Lab” selector. Such query can be verbose, acting on text- based labels, or numerical (e.g. greater/lesser than), acting on numerical labels. All the positive matches are scored directly in the plot by changing the point appearance, without affecting their position in the Cartesian space. The deep integration with labels we implemented can be extremely useful when considering microarray or proteomic data; for example it may allow to highlight the elements involved in the same metabolic pathway, searching for integrated expression profiles [6]. This approach was used in a recent work of our group to draw observations about Codon Adaptation Index (CAI) of ribosomal proteins in different bacterial subdivisions [7]. Since XYLab is oriented to genome-based datasets, usually containing thousands of elements, the per-point labels pop-up when the mouse rest over a point for a user configurable delay time. In fact, the simultaneous visualization of e.g. text-rich data to all points, as frequently implemented in other programs, usually makes the plot unreadable if the number of points is high. Obviously in XYLab the classic all/none approach is also present and can take full advantage of the unlimited mouse- based zoom to analyse specific regions of the plot. Another peculiar aspect of the XYLab is a sub-setting mechanism that we called select-and-paste. Being the plot an interactive area, the user can draw a rectangle in a region containing interesting points and all the associated features are automatically visualized in a dedicated program area, ready to be exported. Thus, the plot itself guides the data selection and avoids the tedious task to look at the full data table to trace-back the desired information. This could be of great importance if points are clustered for reasons that are not obvious and that could depend on biological functions. Other features of the XYLab can be found on its dedicated page, where a detailed manual and a test multivariate dataset are also available (see above). In conclusion, the XYLab offers a simple and intuitive plotting interface aimed at the rapid interpretation of large multivariate datasets in which text and numbers have a comparable importance.

Bioinformation by Biomedical Informatics Publishing Group open access Prediction Model www.bioinformation.net _____________________________________________________________________________ XYLab: an interactive plotting tool for mixed multivariate data observation and interpretation Matteo Ramazzotti1, Elodie Monsellier1 and Donatella Degl’Innocenti1, * 1 Department of Biochemical Sciences, University of Florence, Italy; E-mail: matteo.ramazzotti@unifi.it; *Corresponding author received March 04, 2008; revised May 23, 2008; accepted May 27, 2008; published July 03, 2008 Abstract: The correct display of data is often a key point for interpreting the results of experimental procedures. Multivariate data sets suffer from the problem of representation, since a dimensionality above 3 is beyond the capability of plotting programs. Moreover, non numerical variables such as protein annotations are usually fundamental for a full comprehension of biological data. Here we present a novel interactive XY plotter designed to take the full control of large datasets containing mixed-type variables, provided with an intuitive data management, a powerful labelling system and other features aimed at facilitating data interpretation and subsetting. Availability: XYLab program, test dataset and manual is available at www4.unifi.it/scibio/bioinfo/ XYLab.html Keywords: multivariate data; scatter plot; labels; search; subset Background: Multivariate data set can be defined as a set of observations for each of which a number of variables are present. Nowadays, life science researchers are often in contact with such data set generated by bioinformatics programs or high-throughput applications such as microarray technology or mass spectrometry-based proteomics [1]. In many cases, numerical data are mixed with textual ones, e.g. deriving from protein or gene databanks like those at EMBL or NCBI, just to cite the two major collections, or emerging from functional databases such as GO (Gene Ontology) [2] or KEGG pathway [3]. Numerical variables can be treated with multivariate statistics in order to reduce the dimensionality of the full sample dataset and to locate the most prominent trends. Clustering strategies are also very useful for their ability to group multivariate data into subcategories with homogeneous features [4]. For such purposes a number of packages are available, including generic mathematical packages such as R, MATLAB, SPSS or SYSTAT or dedicated applications such as CLUSTER [5]. The representation of multidimensional datasets reaches its higher level of complexity with three-dimensional plots, which in many applications may have scattered points with colours/dimensions proportional to other variables and may be rotated. In addition, labels containing additional data can usually be appended to points. The above mentioned packages are provided with excellent data display capabilities, and a number of stand alone applications can be used to represent multidimensional plots (e.g. Graphis, Voxler). Nevertheless, such graphical complexity is usually targeted to the finest representation of selected data and not intended for routine interpretation tasks, when a multitude of variables is to be screened and evaluated. Besides, the generation and management of such graphs require time and an expertise level that is not common among experimentalists. ISSN 0973-2063 Bioinformation 2(9): 392-394 (2008) For routine usage, two-dimensional (XY) scatter plots are the most used plot type and, despite their simplicity, they can offer details that are not evident in other approaches. On the other hand, they segregate multivariate data in pair-wise spaces, thus increasing the number of plots and frequently making them hard to coordinate and, eventually, to re-unify. Software We propose a minimalist approach that addresses the problem of multidimensionality in an intuitive fashion. We developed a highly interactive, one-window-based plotting tool (XYLab, see Figure 1) that loads data from simple column-based tables to build up an XY scatter plot with per-point pop-up labels. The plot area is controlled by three easily accessible selectors, named “X”, “Y” and “Lab” since columns containing numerical variables are automatically detected and used to feed the first two selectors, while the latter may also contain nonplottable variables such as text-based ones. XYLab is aimed at offering to the user an easy-handling, fast and full control of what to plot and which labels to show: a simple change in the variable choice from the selectors makes an update of the plot with automatic rescaling and optimization. This allows to visually explore a number of data trends and interrelations in minutes. The presence of a dedicated and real-time effective “Lab” selector makes the labels readily tuneable and grants them an importance similar to the X and Y coordinates, instead of being relegated to a less accessible plot option as usually happens in other applications. To exploit at best this label integration, XYLab implements a search-in-plot procedure: in practice, we introduced a text box that is read before plotting the points and that may contain a query directed against the variable selected in the “Lab” selector. Such query can be verbose, acting on text-based labels, or numerical (e.g. greater/lesser than), acting on numerical labels. All the 392 Bioinformation, an open access forum © 2008 Biomedical Informatics Publishing Group Bioinformation by Biomedical Informatics Publishing Group www.bioinformation.net open access Prediction Model _____________________________________________________________________________ positive matches are scored directly in the plot by changing the point appearance, without affecting their position in the Cartesian space. The deep integration with labels we implemented can be extremely useful when considering microarray or proteomic data; for example it may allow to highlight the elements involved in the same metabolic pathway, searching for integrated expression profiles [6]. This approach was used in a recent work of our group to draw observations about Codon Adaptation Index (CAI) of ribosomal proteins in different bacterial subdivisions [7]. Figure 1: A snaphot of XYLab one-window-based interface. The program has three main sections: the interactive plot area at the top, where scatterplots are displayed, the copy area at center, where information about copied points is displayed and the control area in the bottom containing the three plot controllers, the “Find” box and other elements that coordinate the plot interaction. The presence of a dedicated and real-time effective “Lab” selector makes the labels readily tuneable and grants them an importance similar to the X and Y coordinates, instead of being relegated to a less accessible plot option as usually happens in other applications. To exploit at best this label integration, XYLab implements a search-in-plot procedure: in practice, we introduced a text box that is read before plotting the points and that may contain a query directed against the variable selected in the “Lab” selector. Such query can be verbose, acting on textbased labels, or numerical (e.g. greater/lesser than), acting on numerical labels. All the positive matches are scored directly in the plot by changing the point appearance, without affecting their position in the Cartesian space. The deep integration with labels we implemented can be extremely useful when considering microarray or proteomic data; for example it may allow to highlight the elements involved in the same metabolic pathway, searching for integrated expression profiles [6]. This approach was used in a recent work of our group to draw observations about Codon Adaptation Index (CAI) of ribosomal proteins in different bacterial subdivisions [7]. Since XYLab is oriented to genome-based datasets, usually containing thousands of elements, the per-point labels pop-up when the mouse rest over a point for a user configurable delay ISSN 0973-2063 393 time. In fact, the simultaneous visualization of e.g. text-rich data to all points, as frequently implemented in other programs, usually makes the plot unreadable if the number of points is high. Obviously in XYLab the classic all/none approach is also present and can take full advantage of the unlimited mousebased zoom to analyse specific regions of the plot. Another peculiar aspect of the XYLab is a sub-setting mechanism that we called select-and-paste. Being the plot an interactive area, the user can draw a rectangle in a region containing interesting points and all the associated features are automatically visualized in a dedicated program area, ready to be exported. Thus, the plot itself guides the data selection and avoids the tedious task to look at the full data table to trace-back the desired information. This could be of great importance if points are clustered for reasons that are not obvious and that could depend on biological functions. Other features of the XYLab can be found on its dedicated page, where a detailed manual and a test multivariate dataset are also available (see above). In conclusion, the XYLab offers a simple and intuitive plotting interface aimed at the rapid interpretation of large multivariate datasets in which text and numbers have a comparable importance. Bioinformation 2(9): 392-394 (2008) Bioinformation, an open access forum © 2008 Biomedical Informatics Publishing Group Bioinformation by Biomedical Informatics Publishing Group open access Prediction Model www.bioinformation.net _______________________________________________________________________________ a mathematical data management to the current visualization efficiency. Input: XYLab input consists in simple text files organized in tab/comma-separated entries with variable names in the first row. Every spreadsheet program or bioinformatics application can easily generate such files. In addition, properly formatted data can be pasted directly in the XYLab from the computer clipboard for a rapid visualization. References: [01] D. M. Rocke, Semin Cell Dev Biol., 15: 703 (2004) [PMID: 15561590] [02] M. Ashburner et al., Nat Genet., 25: 25 (2000) [PMID: 10802651] [03] M. Kanehisa et al., Nucleic Acids Res., 30: 42 (2002) [PMID: 11752249] [04] I. T. Joliffe and B. J. Morgan, Stat Methods Med Res., 1: 69 (1992) [PMID: 17233561] [05] M. B. Eisen et al., PNAS, 95: 14863 (1988) [PMID: 9843981] [06] F. Markowetz and O. G. Troyanskaya, Mol Biosyst., 3: 478 (2007) [PMID: 17579773] [07] M. Ramazzotti et al., In Silico Biology, 7: 0035 (2007) Output: The XYLab exports the plots as vector images. The results of the search-in-plot and select-and-paste procedures can be saved as text files or copied to external applications. In addition, the program can save plot-based subsets of the full dataset. Caveat and future development: The program is written in perl (with the Tk graphic library) and developed on MS Windows and tested on Debian linux machines. In the future we are planning to introduce curvefitting and multivariate analysis modules in order to integrate Edited by W. Cuff Citation: Ramazzotti et al., Bioinformation 2(9): 392-394 (2008) License statement: This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original author and source are credited. ISSN 0973-2063 Bioinformation 2(9): 392-394 (2008) 394 Bioinformation, an open access forum © 2008 Biomedical Informatics Publishing Group

Log In

XYLab: an interactive plotting tool for mixed multivariate data observation and interpretation