Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Wikidata search results differ significantly between English and British English
Open, Needs TriagePublicBUG REPORT

Assigned To
None
Authored By
Nikki
Apr 12 2023, 10:26 AM
Referenced Files
F57215316: figure-1.png
Aug 9 2024, 7:55 PM
F57068678: en-ca.png
Aug 5 2024, 12:24 PM
F57068674: en.png
Aug 5 2024, 12:24 PM
F57061314: en-gb-after-undo1.png
Aug 5 2024, 11:27 AM
F57061354: en-gb-before-undo1.png
Aug 5 2024, 11:27 AM
F36948067: wh-en.png
Apr 12 2023, 10:26 AM
F36948066: wh-en-gb.png
Apr 12 2023, 10:26 AM
F36948071: wh3-en.png
Apr 12 2023, 10:26 AM

Description

When searching Wikidata, the search results when the interface language is set to English versus when it is set to British English differ far more than expected.

Example: "wh" using English vs "wh" using British English.

Top ten results on Special:Search, for en and en-gb:
wh2-en.png (892×606 px, 97 KB) wh2-en-gb.png (921×337 px, 102 KB)

There are no results in common.
The top result for en is the 28th result for en-gb.
The second result for en is the 128th result for en-gb, after many other results which have no obvious connection to the search term.

Top ten results when doing an entity search, for en and en-gb:
wh3-en.png (396×415 px, 36 KB) wh3-en-gb.png (398×416 px, 34 KB)

There is only one result in common.
5 of the 7 results for en show a "wh" label or alias.
Only 1 of the 7 results for en-gb shows a "wh" label or alias.

Q12874593 (watt hour) as displayed on Special:Search, in en and en-gb:
wh-en.png (90×324 px, 8 KB) wh-en-gb.png (87×316 px, 7 KB)

en shows the label with the matched alias after the link.
en-gb replaces the label with the matched alias.

Event Timeline

A few days ago someone removed the en-gb label (which was identical to the en label) from P973 (described at URL).

When I searched for "desc", it no longer appeared in the first set of results:

en-gb-before-undo1.png (479×768 px, 54 KB)

After I undid the edit, it reappeared:

en-gb-after-undo1.png (481×768 px, 53 KB)

(if it matters: these screenshots were taken while adding a new statement on https://www.wikidata.org/wiki/Q13273696)

Here's the results for "desc" on the same item with the interface language set to en:

en.png (480×1 px, 65 KB)

and en-ca:
en-ca.png (482×1 px, 66 KB)

There seems to be a bigger issue with ranking, because those results are worse than en-gb.

en-gb:

PropertyTotal statementsTotal statements (same P31)
described by source (P1343)2,271,64335
described at URL (P973)2,249,41129
relative (P1038)80,7380
DeCS ID (P9272)25670
main subject (P921)30,431,76878
taxon author (P405)00
year of publication of scientific name for taxon (P574)40

en-ca:

PropertyTotal statementsTotal statements (same P31)
DeCS ID (P9272)25670
main subject (P921)30,431,76878
taxon author (P405)00
year of publication of scientific name for taxon (P574)40
described by source (P1343)2,271,64335
child (P40)1,748,4890
described at URL (P973)2,249,41129

en:

PropertyTotal statementsTotal statements (same P31)
main subject (P921)30,431,76878
taxon author (P405)00
year of publication of scientific name for taxon (P574)40
described by source (P1343)2,271,64335
child (P40)1,748,4890
described at URL (P973)2,249,41129
describes a project that uses (P4510)588,2280

I would say the best two matches for "desc" are clearly "described by source" and "described at URL", because they have a lot of statements in general, have been used on items with the same P31 and have labels starting with the same string that is being searched for, closely followed by "main subject" where it's matching the beginning of an alias rather than the label.

It makes no sense at all for "taxon author" and "year of publication of scientific name for taxon" to be so highly ranked. They're not used as main statements, this isn't a taxon item, and the search string doesn't even appear anywhere in the current label, let alone at the beginning.

FYI, I was the one who removed the (ostensibly) redundant en-gb label @Nikki spoke of. I find this complication to be emblematic of a more broader issue—the need to distinguish regional variants from an ultimately "neutral" variant (which in English's case, is unfairly assumed to mean en-us). It would be nice if regional variations were defined explicitly, perhaps laid out like this:

figure-1.png (784×1 px, 32 KB)

I'm aware this issue is partially addressed by T303677 (as many of these variations tend to be duplicates of each other), but this approach deals with the issue of locale-sensitive translation in a culturally-neutral manner (as well as making variants more visible to editors, who may neglect to update regional spellings when modifying descriptions and labels).

Hopefully this makes sense.

FYI, I was the one who removed the (ostensibly) redundant en-gb label @Nikki spoke of. I find this complication to be emblematic of a more broader issue—the need to distinguish regional variants from an ultimately "neutral" variant (which in English's case, is unfairly assumed to mean en-us). It would be nice if regional variations were defined explicitly, perhaps laid out like this:

figure-1.png (784×1 px, 32 KB)

That looks almost exactly like what I proposed ages ago in T209068 :)

(There's also an even older ticket about making en more international - T33874)

I'm aware this issue is partially addressed by T303677 (as many of these variations tend to be duplicates of each other), but this approach deals with the issue of locale-sensitive translation in a culturally-neutral manner (as well as making variants more visible to editors, who may neglect to update regional spellings when modifying descriptions and labels).

Removing descriptions should be fine, that doesn't seem to affect the search ranking.