eMPRess
What is eMPRess
eMPRess is a software tool for reconciling pairs of phylogenetic trees such as host-parasite, host-symbiont, and species-gene trees under the Duplication-Transfer-Loss (DTL) model. The eMPRess tool was developed at Harvey Mudd College and is the successor for our Jane reconciliation tool. eMPRess has many features that are based on new and efficient algorithms. Read more about those features below.
eMPRess takes two undated binary phylogenetic trees (e.g., host and parasite, host and symbiont, species and gene) and an association of their tips as input. eMPRess addresses several important issues that are generally not supported in other existing tools. Among them are:
Choosing the event costs is notoriously difficult. Different choices of the costs for duplication, transfer, and loss events can give rise to very different reconciliations and, consequently, very difficult conclusions. eMPRess helps guide the user in selecting event costs by computing and displaying "event cost regions" that show the different choices of event costs and their impacts on the resulting solutions. This feature allows users to systematically explore the space of event costs.
The number of MPRs, even for a fixed set of event costs, can be extremely large (e.g, in the billions or more, even for trees with several tens of tips). A difficult problem, therefore, is selecting one or more MPRs that best represent the potentially huge solution space. eMPRess provides tools for visualizing the space of MPRs, clustering that space into "similar" MPRs, and finding a best representative MPR in each cluster.
eMPRess computes support values for each event in each reconciliation that it displays. The support value of an event is the fraction of MPRs that contain that event. eMPRess computes these support values exactly rather than by sampling.
eMPRess maintains and expands on many features in Jane, including both a graphical user interface and a command line interface, visualizations of reconciliations, and the ability to save these visualizations as high-quality images for use in publication.
eMPRess video - tutorial and use case
This 20-minute video tutorial provides a brief primer on the reconciliation problem and demonstrates the eMPRess workflow and functionality.
A touch of theory (recommended before getting started)
Maximum Parsimony Reconciliation
eMPRess, Jane, and most other reconciliation tools use a maximum parsimony approach for finding a "best" mapping of the parasite/symbiont tree) onto the host tree. In this formulation, each type of event: duplication, transfer, and loss, have a non-negative cost specified by the user. The objective is to find reconciliation that minimizes the total cost of the constituent events weighted by their event costs. Cospeciation is considered a "null event" and therefore has cost preset to zero.
Event Costs
Event costs are notoriously difficult to estimate. Many tools have default event costs (e.g., Jane's defaults are 1, 2, and 1 for duplication, transfer, and loss, respectively) and studies are often performed using just the default values. However, different event costs can lead to different solutions and thus different conclusions. For example, if one event has a much lower cost than others, a maximum parsimony reconciliation is likely to favor solutions with more of those kinds of events.
eMPRess's "View cost space" feature uses a technique called Pareto-optimal event counts to show you the impact of different event costs and to allow you to select event costs in a principled and systematic way. Specifically, note that event costs are just relative amounts; there no intrinsic meaning to a unit of cost, so choosing duplication, transfer, and loss of 2, 3, 1 respectively is the same as choosing costs of 200, 300, and 100; the ratios of the costs are the same in both cases. The "View cost space" feature in eMPRess fixes the cost of a loss at 1.0 and then examines the range of costs of duplication and transfer events relative to this cost of 1 for losses. (Recall that cospeciation is a null event and thus has a fixed cost of zero.)
The plot that is displayed by "View cost space" divides up the duplication and transfer cost space into color-coded regions. For any combination of costs in the same region, we will get the same set of MPRs. In other words, in a given color-coded region, it suffices to choose just one point - that is one combination of costs. For example, see this figure which is the event cost regions for the gopher-louse dataset.
Dated versus Undated Trees
eMPRess, Jane, and many other tools assume that the trees are undated. That is, while branch lengths may be provided in the newick input files, they are not used in the reconciliation process. Branch lengths - if given - are not assumed to correspond to actual dates when speciation events occurred in the host and parasite/symbiont trees.
Time-Consistency
While a parent node in a tree clearly occurred before its children, the order of the two children is assumed not to be known. In general, the order of nodes that are not ancestrally related to one another is not known. Consider a reconciliation of a parasite/symbiont tree onto a host tree and consider any particular parasite/symbiont species node p. That node p is mapped by the reconciliation to some host node h (or, perhaps, to the edge terminating at h). Clearly, no descendant of p should be mapped by the reconciliation to an ancestor of h. Any reconciliation that satisfies this condition is said to be weak time-consistent.
Why "weak"? There is another constraint that we also wish to satisfy and this one has to do with transfer (aka host switch) events and the fact that the trees are undated. When a transfer event occurs involving a parasite node p, one of its children, say p', is transferred to a branch in the host tree that is not ancestrally related (that is, not an ancestor nor a descendant) to the host on which p is mapped. We say that p "takes off" from the host branch on which it resides and that p' "lands" on a branch somewhere else on the tree. The place where p' lands is called the "landing site."
Because the tree is not dated, we don't know if the landing site is contemporaneous with the take-off site. In theory, the take-off and landing sites should be contemporaneous, but there's no way to know for sure. We say that a reconciliation is strong time-consistent if it is not only weak time-consistent but also if there exists some ordering of the internal nodes of the host tree that guarantee that for every transfer event, the take-off and landing sites are contemporaneous.
Ideally, we would like strongly time-consistent reconciliations. Here is some good news and bad news: Even finding weakly time-consistent maximum parsimony reconciliations is computationally intractable (NP-hard). Jane uses a heuristic that only considers strongly time-consistent reconciliations, but doesn't guarantee that they are truly maximum parsimony reconciliations (i.e., their total events costs may be higher than optimal). eMPRess, and most other tools, use much faster exact algorithms that do guarantee maximum parsimony but with the possibility that the resulting reconciliations are not time-consistent. eMPRess, however, checks each solution that it finds and indicates whether it is strongly time-consistent (the best outcome), weakly time-consistent, or not even weakly time-consistent.
Dealing with many MPRs
The number of MPRs for a given dataset and a fixed set of event costs can be huge. In some datasets that we have explored, there have been more than 10e50 (1 with 50 zeros after it) MPRs. Nguyen et al. have proposed computing a median MPR in such cases. The median is an MPR that is, roughly speaking, in the "middle" of the space of MPRs and is thus a plausibly good representative. More precisely, the distance between two MPRs is the number of events in which they disagree and a median MPR is one that minimizes the total distance to all other MPRs.
In general, there's not just one median. For example, consider the numbers 1, 2, 3, 4. Both 2 and 3 are medians. In higher-dimensional spaces (such as the space of all MPRs), there can be many medians - in fact a huge number of medians. But, a median is still presumably more representative than a completely random MPR. Thus, in "View reconciliations", if "One MPR" is selected, eMPRess chooses a random median. Since there are many medians, in general, you won't necessarily see the same MPR each time you do this!
There's another useful feature in the event that the number of MPRs is large. That option is to cluster the space of MPRs into groups based on similarity. In the "View solution space" pull-down menu, choose "Clusters". A window pops up to allow you to enter the number of clusters that you desire to construct (which can be any number between 1 - which means no clustering - and the total number of MPRs). In our experience, 2 or 3 clusters is generally sufficient. Then, eMPRess uses a clustering algorithm that clusters MPRs according to their distance from one another, using the distance measure described above. eMPRess displays a histogram of the distances between all pairs of MPRs in the first row, the distances between all MPRs within each of the two clusters in the second row, and so forth up to the maximum number of clusters that you've specified.
Finally in "View Reconciliations", you can choose "One per cluster", which will display one randomly selected median reconciliation in each cluster.
This set of features provides a systematic way to find best representative sets of MPRs when the space of MPRs is too large to be adequately represented by a single MPR.
Download and Install eMPRess
Software license information
eMPRess Software
Copyright (C) 2020 Libeskind-Hadas Research Group, Harvey Mudd College
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details: https://www.gnu.org/licenses.
Register (Optional)
If you would like to be notified of updates or announcements regarding eMPRess, please complete this form.
One-click installation for the eMPRess GUI
If you plan to exclusively use the graphical user interface version of eMPRess, you may be able to perform a quick-and-easy one-click install. If this installer doesn't work on your platform, please use the Install empress from GitHub instructions below.
MacOS
The current one-click installer for eMPRess is no longer maintained; please use the "Install eMPRess from GitHub" instructions below.
Linux
Go to the latest release on Empress GitHub Releases page and download the zip file named linux-empress-app.
Extract the whole directory to a location of your choice. You can do this by right-clicking on the folder and pressing Extract Here. It is important that you extract the whole directory and not just the executable file.
Enter the directory and find a file named empress_gui. Right-click on that file, and click Properties, then select the Permissions tab. Then, check the box that says Allow executing file as a program.
Double click on empress_gui to run empress.
Depending on your system, you might not be able to double click to run the executable. In that case, open the terminal and run ./empress_gui from terminal.
Windows
Go to the latest release on Empress GitHub Releases page and download the zip file named windows-empress-app.
Extract the whole directory to a location of your choice. You can do this by right-clicking the folder and pressing Extract All. It is important that you extract the whole directory and not just the executable file.
Enter the directory and find a file named empress_gui. Double click on that file.
Windows might prevent you from running the application, saying Windows protected your PC. Click on More Info and then click Run Anyway. Windows might take some time to scan the application for viruses. After it finishes scanning for viruses, the application will automatically open.
Install eMPRess from GitHub
Please refer to Install Empress for Development wiki on GitHub.
Please see the documentation for details on running both the GUI and the CLI.
Sample data
This zip file contains four sample datasets, each comprising a host, parasite/symbiont, and mapping (mapping of the tips of the two trees).
Fig-wasp dataset from from Weiblen GD and Bush GW, Speciation in fig pollinators and parasites. Molecular Ecology 2002, 11, 1573-1578.
Gopher-louse dataset from Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988, 332:258-259.
Seabird-louse dataset from Paterson AM, Wallis GP, Wallis LJ, Gray RD, Seabird louse coevolution: complex histories revealed by 12S rRNA sequences and reconciliation analyses. Systematic Biology 2000, 49, 383-399.
Finches and brood parasites from Sorenson MD, Balakrishnan CN, Payne RB, Clade-limited colonization in brood parasitic finches (Vidua spp.) Systematic Biology, 2004, 53, 140-153.
Documentation
Input Files
Three files are required as input: The host tree, the parasite (symbiont) tree, and a tip mapping.
The host and parasite trees must be in newick format in which all leaves and internal nodes are named and have unique names (no names repeated anywhere). The files must have the extensions .nwk. These trees can have branch length information, but eMPRess ignores it. However, species names should have no whitespace in them. The newick standard used here is that whitespace should be replaced with an underscore symbol. For example Diomedea epomophora should, instead, be Diomedea_epomophora.
The mapping is a text file that ends with the extension .mapping and specifies the association of the tips of the parasite tree to the tips of the host tree. Each line in the file is of the form:
parasiteTipName : hostTipName
Note that this mapping must associate each parasite tip with at most one host tip. It is fine for a parasite tip not to be mapped to any host tip, but a parasite tip cannot be mapped to more than one host tip. Similarly, it is fine for a host tip not to be mapped from any parasite tip. Finally, it's fine for multiple parasite tips to be mapped to the same host tip.
Running eMPRess through the Graphical User Interface
Documentation on running eMPRess through the GUI is available here.
Running eMPRess through the Command Line Interface
Documentation on running eMPRess through the CLI is available here.
Credits and citing eMPRess
Many people contributed to the development of eMPRess, both in the development of the algorithms and the implementation of the software tool.
If you use eMPRess in your work, please cite
"eMPRess: A Systematic Cophylogeny Reconciliation Tool" by S. Sanitchaivekin, Q. Yang, J. Liu, R. Mawhorter, J. Jiang, T. Wesley, Y-C. Wu, and Ran Libeskind-Hadas, in preparation.
The algorithms employed in eMPRess were published in these papers:
"Pareto-Optimal Phylogenetic Tree Reconciliation" by R. Libeskind-Hadas, Y-C Wu, M. Bansal, and M. Kellis, Bioinformatics, Volume 30, Issue 12, 15 June 2014, Pages i87–i95, https://doi.org/10.1093/bioinformatics/btu289
"An Efficient Exact Algorithm for Computing All Pairwise Distances Between Reconciliations in the Duplication-Transfer-Loss Model" by S. Santichaivekin, R. Mawhorter, and R. Libeskind-Hadas, BMC Bioinformatics, 2019 Dec 17;20(Suppl 20):636. doi: 10.1186/s12859-019-3203-9
"Hierarchical Clustering of Maximum Parsimony Reconciliations" by R. Mawhorter and R. Libeskind-Hadas, BMC Bioinformatics, 26 Nov 2019, 20(1):612 DOI: 10.1186/s12859-019-3223-5
The eMPRess code base was developed by S. Santichaivekin, R. Mawhorter, J. Liu, Q. Yang, J.Jiang, T. Wesley, Y-C Wu, and R. Libeskind-Hadas with additional contributions by C. Ngo, P. Andrews, S. Sehra, Adrian Garcia, Alberto Garcia, D. Makhervaks, and Z. Witzel.
FAQ
Why doesn't my newick file load?
Make sure that the trees don't have polytomies.
My files load, but eMPRess can't find reconciliations. Why?
Make sure that all of the nodes have different names.
Feedback, known issues, reporting bugs, etc.
The current version of eMPRess (version 1.1) has some known limitations or bugs listed below. If you find others, or would like to give us feedback or suggestions, please complete this feedback form.
Here are some issues that we're aware of in the current version of eMPRess:
If you install empress from source code (by using git clone), install python from Homebrew, and is using dark mode on macOS. The texts on the eMPRess GUI will not show up. You can fix this by switching to light mode. The eMPRess one-click executable does not have this problem.
If you install empress from source code (by using git clone) and install python from Anaconda or Python.org. You will not be able to save figures in formats other than png. You can fix this by downloading the one-click executable of eMPRess or using the command line interface to save the figure.
The development of eMPRess was supported by grant 1905885 from the National Science Foundation to Harvey Mudd College.