Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Wikidata tab on leaves's description #863

Open
oolonek opened this issue Jul 9, 2024 · 19 comments
Open

Additional Wikidata tab on leaves's description #863

oolonek opened this issue Jul 9, 2024 · 19 comments

Comments

@oolonek
Copy link

oolonek commented Jul 9, 2024

This would be a nice addition to acess to the Wikidata page of a given taxon when clicking on it's leaf.

For example for https://www.onezoom.org/life/@Aloe_ferox=608115 one could reach https://www.wikidata.org/wiki/Q1194889

Using Qlever, all pairs of Open Tree of Life IDs and Wikidata QID can be retrieve in ms https://qlever.cs.uni-freiburg.de/wikidata/MjoDT0?exec=true. Currently yielding 2'034'851 pairs.

@hyanwong
Copy link
Member

hyanwong commented Jul 9, 2024

We have the wikidata ID anyway, in the ordered_leaves table, so we don't need to use the Qlever site (although I'm intrigued how that site works).

@davidebbo
Copy link
Contributor

davidebbo commented Jul 9, 2024

As a workaround, note that from the Wikipedia page, you can choose Tools / Wikidata item to go to that. So it's indirect, but there is a path to it...

@hyanwong
Copy link
Member

hyanwong commented Jul 9, 2024

Aha, of course, the OTT IDs are now on wikidata (they used not to be, I argued for their introduction), so we can find the mapping using a sparQL command. Neat.

@davidebbo
Copy link
Contributor

the OTT IDs are now on wikidata

Oh, I didn't know that! It's P9157.

@hyanwong
Copy link
Member

hyanwong commented Jul 9, 2024

Yes, I noticed it the other day. It's new, I think (created 2021)

@hyanwong
Copy link
Member

hyanwong commented Jul 9, 2024

It could be that this is a better way to get the mappings now, rather than going via the ncbi IDs etc.

We could probably check how accurate and comprehensive our mapping is, versus the one on wikidata. If we can simply move to using wikidata, it would probably simplify the code considerably. However, my suspicion is that there are lots of OTT taxa that have NCBI / GBIF ids but which aren't currently on wikidata.

@davidebbo
Copy link
Contributor

It could be that this is a better way to get the mappings now, rather than going via the ncbi IDs etc.

We could probably check how accurate and comprehensive our mapping is, versus the one on wikidata. If we can simply move to using wikidata, it would probably simplify the code considerably. However, my suspicion is that there are lots of OTT taxa that have NCBI / GBIF ids but which aren't currently on wikidata.

Yes, that was my first thought when I saw that. It has the potential to simplify things a lot. For now, it would be easy to add some instrumentation that checks whether the QID we find via other paths maps back to the same ott.

Anyway, we're digressing a bit from @oolonek's request 😄

@oolonek
Copy link
Author

oolonek commented Jul 10, 2024

It could be that this is a better way to get the mappings now, rather than going via the ncbi IDs etc.

We could probably check how accurate and comprehensive our mapping is, versus the one on wikidata. If we can simply move to using wikidata, it would probably simplify the code considerably. However, my suspicion is that there are lots of OTT taxa that have NCBI / GBIF ids but which aren't currently on wikidata.

This would be interesting to find out which are missing. Do you expect taxa not to have their WD entry or rather to be present on WD but simply lack their OTT id on their WD page ? In both case it will be of interest to find out and eventually work on pushing the missing info to WD. I will look at this on my side also. Thanks for your quick feedbacks :)

@davidebbo
Copy link
Contributor

It's probably going to be a combination of things.

  • Looking at the DB, out of 2,235,475 leaf taxa, 403,072 don't have a WD entry that we were able to locate (18%)
  • When we have WD entries, some may be missing an OTT
  • Some may have an OTT that is different from the OTT we have for the taxa

But for the last two, we really don't know right now because we've never looked at the WD OTT field. But it would be interesting to get that data.

@hyanwong
Copy link
Member

Good summary. Thanks @davidebbo . And yes, it would be interesting to see how this compares to what wikidata think is the correct mapping.

@mdrishti
Copy link

mdrishti commented Jul 11, 2024

Hi,

I have also been working on getting the taxonomic ids from ott and taxonomies from 11 other dbs (gbif, ncbi, eol, itis etc) corresponding to wikidata ids. Found that ~2,032,649 wd ids have ott and 1,435,238 wd ids don't. The latter map to other databases.
On the other hand, out of total 4,528,302 ott ids, 2,530,549 don't have wd ids.

There are 3,826,740 ott ids which are either at species/strain level. I was wondering about the criteria used for keeping the ott id in OneZoom. Also, do all 2,235,475 leaf taxa in OneZoom have an ott id?

Too many numbers above! Sorry!

@hyanwong
Copy link
Member

hyanwong commented Jul 11, 2024

I was wondering about the criteria used for keeping the ott id in OneZoom

We tend to retain all the OTTs that are present in the synthetic OpenTree (give or take some that differ because of using bespoke trees in particular areas of the tree, mostly mammals / birds)

@davidebbo
Copy link
Contributor

3,826,740 - 2,235,475 = 1,591,265. That's a huge number of species otts that are not in the OneZoom tree. But I do see the same thing if I filter taxonomy.tsv for only species.

I guess that means that all these are incertae sedis, and hence not in the synthetic tree?

@davidebbo
Copy link
Contributor

I did some instrumentation. Out of 1,817,682 OneZoom otts that we are mapping to a Wikidata item:

  • 1,607,691 (~88%) have an ott in Wikidata, and it matches our ott
  • 2,893 (<1%) have an ott in Wikidata that does not match our ott
  • 207,098 (~11%) don't have an ott in Wikidata

@hyanwong
Copy link
Member

Nice. Thanks @davidebbo. It's good there aren't many wrong matches. Seems like we could switch at some point to using wikidata to provide all our mapping then. What we would be missing is data to do with other identifiers, like NCBI, which we get automatically from the opentree.

However, I think it would be fine to omit all the ncbi -> wikidata mapping, and just go straight to mapping OTT from the wikidata JSON dump to the WD qID.

@oolonek
Copy link
Author

oolonek commented Jul 12, 2024

I did some instrumentation. Out of 1,817,682 OneZoom otts that we are mapping to a Wikidata item:

  • 1,607,691 (~88%) have an ott in Wikidata, and it matches our ott
  • 2,893 (<1%) have an ott in Wikidata that does not match our ott
  • 207,098 (~11%) don't have an ott in Wikidata

Hi @davidebbo are these files somewhere on the OneZoom repo or were they generated elsewhere ? Would you mind sharing ? Also, I guess it is the case, but just to be sure, could you confirm its OTT 3.6 you are using ?

@oolonek
Copy link
Author

oolonek commented Jul 12, 2024

Nice. Thanks @davidebbo. It's good there aren't many wrong matches. Seems like we could switch at some point to using wikidata to provide all our mapping then. What we would be missing is data to do with other identifiers, like NCBI, which we get automatically from the opentree.

However, I think it would be fine to omit all the ncbi -> wikidata mapping, and just go straight to mapping OTT from the wikidata JSON dump to the WD qID.

Why not also rely on WD to retrieve the NCBI ids ?
WD could be the single source for all taxa ids like this their would be a single place to work on to improve mappings.

See https://qlever.cs.uni-freiburg.de/wikidata/QObdaz?exec=true

@davidebbo
Copy link
Contributor

Why not also rely on WD to retrieve the NCBI ids ?
WD could be the single source for all taxa ids like this their would be a single place to work on to improve mappings.

Yes, that would be a good end state if the data quality is sufficient. In such a world, we may not need to use the OpenTree taxonomy file at all. We could also do away with all the EOL logic.

Basically, we'd have:

  • A newick tree that includes ott's
  • We'd use WD to map that ott to a WD item, and to all the other sources
  • We'd also use WD for all medias

If we went in that direction, we should probably do a rewrite of the tree building logic, rather than iteratively move it in that direction.

I don't think we're quite ready for that yet, but it is a direction.

@hyanwong
Copy link
Member

hyanwong commented Jul 12, 2024

Why not also rely on WD to retrieve the NCBI ids ?

I'm not sure that's so sensible, because the OTT IDs are based on NCBI, GBIF, etc. So the OTT taxonomy.tsv file is the canonical source of the NCBI ids that go into generating an OTT.

I.e. the mappings in the taxonomy.tsv file is the definition of an OTT, for a given OpenTree release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants