Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Google not indexing Wikisource properly for years
Open, Needs TriagePublicBUG REPORT

Assigned To
Authored By
Darwinius
Dec 19 2022, 11:43 PM
Referenced Files
F41651521: Resources on Google Search.pdf
Jan 8 2024, 10:36 AM
F41631008: google-update.png
Dec 24 2023, 4:19 AM
F37132282: duckduckgo.png
Jul 7 2023, 3:45 PM
F37132284: google.png
Jul 7 2023, 3:45 PM
F35959436: agg-Mobile_en.po.txt
Jan 3 2023, 8:42 AM
F35887099: image.png
Dec 22 2022, 10:33 AM
F35887095: image.png
Dec 22 2022, 10:33 AM
F35886522: image.png
Dec 21 2022, 8:44 PM
Tokens
"Love" token, awarded by Effeietsanders."Cookie" token, awarded by Replayful."Yellow Medal" token, awarded by Dzahn.

Description

Steps to replicate the issue (include links if applicable):

  • Pick a page created months ago on Wikisource, for instance, https://de.wikisource.org/w/index.php?title=Zedler:Puppenwerck&oldid=3795414 (June 2021)
  • Pick a sentence from the page and search for it on Google.
  • If that works, then try with a different page (e.g. via Special:Random), because often the URLs used to talk about this bug do end up getting indexed and so no longer show the bug in operation.

What happens?:

No results on Wikisource appear.

What should have happened instead?:

Should be indexed by Google.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Detected first on ws.pt, tested on ws.es, ws.he, ws.fr, ws.de, ws.en, with similar results: Google doesn't seem to be indexing new pages from Wikisource, which greatly diminishes the value of this project, and renders basically impossible any partnership based on it - we are preparing one with the National Library of Portugal, with other tasks running related to a number of other Portuguese archives and repositories.

This one seems to have been an exception, as it appears in the Google search, even if with an outdated version - https://pt.wikisource.org/wiki/Solicita%C3%A7%C3%B5es_do_Bangu%C3%AA/C001/10

Somewhat similar results on Bing.

Can someone have a look at this with the Google Search Console?

Seems to be related to T238090 and https://support.google.com/webmasters/thread/16243149/google-no-longer-indexing-some-wikisource-pages?hl=en

Possibly related to this T318046

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Including SRE as it involves Google Search Console

Including SRE as it involves Google Search Console

I don't think SRE is actually the right one, unless it's a Search-Console-access-request.

I think it's Production-Analytics or maybe Fundraising-Analysis,also see T238090#8469500.

@Dzahn a request for someone to look into this using the Search Console counts as that?

If you are asking for access to the Search Console, please clarify who needs access to what and add the access request tag. That would make it show up in clinic duty. If you are looking for someone to debug a SEO issue, talk to one of the analytics teams. Cheers

If you are asking for access to the Search Console, please clarify who needs access to what and add the access request tag. That would make it show up in clinic duty. If you are looking for someone to debug a SEO issue, talk to one of the analytics teams. Cheers

I think in this case, it's the latter, atleast for now.

The premise is that Wikisource isn't doing as well on Google search with Google taking fairly large amounts of time to index newly created Books/Pages and it would be great to deep dive/debug why that is happening.

Any idea who might be the best person to contact regarding this ?

Any idea who might be the best person to contact regarding this ?

I didn't have an individual name. There is https://phabricator.wikimedia.org/project/members/1163/ but the list is short.

That is basically why I phrased it "one of the analytics teams" at first.

I think the technical answer is "the people who previously asked for access to search console via Phab tickets" and that means maybe https://phabricator.wikimedia.org/project/members/5882/ is the place to look.

From there we can look at the history of closed tickets and get to https://phabricator.wikimedia.org/maniphest/query/F73LiQFG2VAJ/#R

I was going to say maybe try asking https://www.mediawiki.org/wiki/Product_Analytics#Leadership

but then I found this: T302625 which said " I'll handle administration of of our search consoles on various search engines and help Foundation staffers get access to these on a need basis, while tracking who has access and for what purposes."

so. I would say try @SCherukuwada

I've mostly focused on search performance and stats for the Wikipedias and haven't had a chance to set up and build an understanding of where we are with Wikisource yet. That's why I don't have an immediate answer for this problem.

I need to do a couple of things to first make sure I have access to Wikisource in search console. As soon as that happens I'll dig right into it. Please expect this to happen sometime in the second or third week of January.

A single URL of a work I recently added to ws.pt, and on which I've been working in, appears to have been noticed by Google: https://pt.wikisource.org/w/index.php?title=P%C3%A1gina:Elucidario_Madeirense,_1998,_vol._I.pdf/11&diff=471867&oldid=471865

image.png (302×531 px, 27 KB)

However, it has no description nor any info about it. Clicking the "More info on this problem" link leads me to the Google SC:

image.png (520×861 px, 74 KB)

https://support.google.com/webmasters/answer/7489871?hl=pt

It says Google was unable to describe the page because the Wikisource site is actively blocking it.

I've mostly focused on search performance and stats for the Wikipedias and haven't had a chance to set up and build an understanding of where we are with Wikisource yet. That's why I don't have an immediate answer for this problem.

I need to do a couple of things to first make sure I have access to Wikisource in search console. As soon as that happens I'll dig right into it. Please expect this to happen sometime in the second or third week of January.

Sure that works :)

@Seddon Was probably indexed in the last couple of days, most probably related to it appearing on this thread, since many pages created before that one are still not indexed by Google.

Examples:
https://de.wikisource.org/wiki/Zedler:Lehn-Sachen

image.png (233×689 px, 20 KB)

https://de.wikisource.org/wiki/Zedler:Maschine,_(einfache_oder_schlechte)

image.png (273×749 px, 28 KB)

Samhaljml triaged this task as Unbreak Now! priority.Jan 3 2023, 8:42 AM
Samhaljml updated the task description. (Show Details)
Peachey88 lowered the priority of this task from Unbreak Now! to Needs Triage.Jan 3 2023, 8:44 AM
Peachey88 updated the task description. (Show Details)

@SCherukuwada: Hi, any news on this to share? Thanks :)

Apologies for the ridiculous delay. I have Wikisource search console access now and am looking at it.

@SCherukuwada - Additional context on this ticket. This issue has been reported multiple times in Wikisource Telegram group and also in the Wikisource Community meetings. Volunteers have expressed that if the works they are creating are not going to show up in search results, it is not worth contributing their time and energy on the project.

As far as I can tell, a lot of the pages that aren't appearing in the index are simply not linked to from within the Wiki. There are no sitemaps any more, and Special:Lonelypages is uncrawlable because there's a robots.txt rule blocking all Special: pages from being crawled.

I feel like I'm missing some history here and some history of how we expect these kinds of orphaned/lonely pages to be discovered. I'm going to ask around and get back.

As far as I can tell, a lot of the pages that aren't appearing in the index are simply not linked to from within the Wiki. There are no sitemaps any more, and Special:Lonelypages is uncrawlable because there's a robots.txt rule blocking all Special: pages from being crawled.

I feel like I'm missing some history here and some history of how we expect these kinds of orphaned/lonely pages to be discovered. I'm going to ask around and get back.

Do we disallow crawling categories, I feel like every page should be in some category which should allow the bots to find lonely pages.

I assume in Wikipedias, the use of Navboxes somewhat mitigates this issue.

As far as I can tell, a lot of the pages that aren't appearing in the index are simply not linked to from within the Wiki.

On enWS at least, all newly proofread texts are linked for a while on the Main Page under "New Texts" and later on the archive of that list (which is also linked through the Main Page). All texts should also be linked on an associated page in the Author: namespace (listing all the works by that author), and Author: pages should be findable for a crawler through our manually curated alphabetical lists of authors (linked in the sidebar).

There may certainly be texts that fall through these cracks, but some of the major complaints have been regarding texts newly proofread and listed on the main page.

I also think that if good guidance were provided the projects would heed this in designing their practices, if that's what it takes to make our content findable in web search agents.

PS. It occurs to me that to a web crawler a lot of our content will appear at least two places: in the mainspace transclusion and in the source Page:-namespace page. And we can easily have multiple copies of each book page. You don't suppose GoogleBot has us tagged as a SEO-spam site, or downranks us into oblivion because it sees red flags of that?

@Soda Yeah Navboxes would indeed have helped.

Tell me if this makes sense re: categories. If a page isn't in any category or is in a category but isn't otherwise linked to, how would the crawler know how to find the category? As far as I can tell the mechanism for listing all categories (Special:Categories) is also in the special namespace, which is uncrawlable.

@Xover That's very useful to know.

Whatever is linked temporarily on Main: might not necessarily be picked up. Do you have a sense of how long "a while" tends to be? The crawl frequency is adaptive, so there might be days between crawls sometimes. As an anecdote, the Main_Page on en.wikisource has been crawled over 24 hours ago as of this writing.

The Authors link is indeed a good lead, and I wasn't even aware of that. So https://en.wikisource.org/wiki/Category:Authors_by_alphabetical_order was last crawled about 6 days ago.

I'm going to look at some unindexed articles and see if they're linked to authors correctly. In the interim if you have more examples of unindexed articles, that would be helpful. Please try to not paste them as complete links so that they remain uncrawled.

As for guidance, once we get to the bottom of why URLs are continuing to remain unindexed we'll simply turn this into guidance that we can then disseminate.

I can't see that we're getting downranked into oblivion - at least not just yet. If we are, that'll be the next problem to solve once we're getting indexed and ranked at all. :-)

Here are some unindexed articles (confirmed from search console and from Google). I came upon them by simply hitting "Zufällige Seite" (Random Page) on de Wikisource and checking if the resulting page is indexed at all.

de.ws /w/index.php?title=Seite:Handbuch_der_Politik_Band_3.pdf/393

dw.es /wiki/Seite:Die_Staats-Vertr%C3%A4ge_des_K%C3%B6nigreichs_Bayern_von_1806_bis_1858.pdf/219

Looking at why none of the pages of this work have been indexed.

I've been through dozens of random articles on the French wikisource and couldn't find a single unidnexed one. @SGill do you have any unindexed ones on any Wikisource that I can examine?

@SCherukuwada I have an interesting one: [[:s:pt:O Movimento Modernista]]

Looking for

"O Movimento Modernista" site:wikisource.org

on google, it shows that the crawler noticed the link on the Main Page of the project and even on the author's page, but did not "enter" the page of the work.

@SCherukuwada , this one on Bengali Wikisource is currently on Main Page, but while searching on Google search, only the index page is showing up.

Thank you for the supporting links.

Having discussed this internally with other Foundation staff, there seems to be some fragmented knowledge around how data dumps may be a part of this puzzle that we need to put together to understand what's happening. I'm trying to get to the bottom of that. As soon as I have answers I'll respond here. I expect to hear back from some people in the next couple of days.

Again, thank you all for your patience.

OK, having talked to some folks in the Enterprise org and other teams and having eliminated a few possible problems, the one I'm investigating now is the possibility that for some reason Google's bot is ratelimiting itself. I'll continue to post any findings here.

@Soda Yeah Navboxes would indeed have helped.

Tell me if this makes sense re: categories. If a page isn't in any category or is in a category but isn't otherwise linked to, how would the crawler know how to find the category? As far as I can tell the mechanism for listing all categories (Special:Categories) is also in the special namespace, which is uncrawlable.

In all the examples I've tried, Google clearly has found the category since it includes the category in the search results, but for some reason hasn't indexed any of the pages in the category, while DuckDuckGo finds everything - the category, the file on Commons, the index page, the individual pages, the transcluded version in the main namespace...

An example:
Here's what I get when searching for "Lohengrin (V. Busuttil)" (deliberately not adding any links because I don't want to affect the search results) in Google vs DuckDuckGo:
google.png (454×696 px, 72 KB) duckduckgo.png (620×765 px, 107 KB)
Google returns only two results, one of which highlights the page name that it should have indexed but didn't. DuckDuckGo returns 26 results in total (only the first three are shown), 25 pages on Wikisource plus the file page on Commons.

@Soda Yeah Navboxes would indeed have helped.

Tell me if this makes sense re: categories. If a page isn't in any category or is in a category but isn't otherwise linked to, how would the crawler know how to find the category? As far as I can tell the mechanism for listing all categories (Special:Categories) is also in the special namespace, which is uncrawlable.

In all the examples I've tried, Google clearly has found the category since it includes the category in the search results, but for some reason hasn't indexed any of the pages in the category, while DuckDuckGo finds everything - the category, the file on Commons, the index page, the individual pages, the transcluded version in the main namespace...

An example:
Here's what I get when searching for "Lohengrin (V. Busuttil)" (deliberately not adding any links because I don't want to affect the search results) in Google vs DuckDuckGo:
google.png (454×696 px, 72 KB) duckduckgo.png (620×765 px, 107 KB)
Google returns only two results, one of which highlights the page name that it should have indexed but didn't. DuckDuckGo returns 26 results in total (only the first three are shown), 25 pages on Wikisource plus the file page on Commons.

Yeah the general pattern I am seeing on the search console mirrors this behaviour, I wonder if we are failing the "Content Quality" or some such similar criteria on categories (since it is effectively a bunch of links)

Mail threads have been started on this issue at wikimedia-l and wikisource-l mailing lists.

This has been reported to Google. We're waiting for them to get back.

We're meeting with them in the next couple of weeks to troubleshoot our scraping problems. Will report back once we learn more.

We met with Google to discuss this further. Google will provide more details on this soon, but the crux of the matter is that not all pages are guaranteed to be crawled, indexed, and served, as is stated on Search Central documentation.

We met with Google to discuss this further. Google will provide more details on this soon, but the crux of the matter is that not all pages are guaranteed to be crawled, indexed, and served, as is stated on Search Central documentation.

The article that you linked lists the following:

  • Network issues
  • Too much Javascript
  • Server problems
  • Disallowing via robots.txt

AFAIK, robots.txt isn't the problem here, any way we can debug and see if any of the others are the issue ?

The important takeaway from this (as per our discussion) was this bit:

Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials.

They'll share more details shortly. We've given them some URLs to debug on their end to see if the behaviour is intended.

An example:
Here's what I get when searching for "Lohengrin (V. Busuttil)" (deliberately not adding any links because I don't want to affect the search results) in Google vs DuckDuckGo:
google.png (454×696 px, 72 KB) duckduckgo.png (620×765 px, 107 KB)
Google returns only two results, one of which highlights the page name that it should have indexed but didn't. DuckDuckGo returns 26 results in total (only the first three are shown), 25 pages on Wikisource plus the file page on Commons.

I just checked this one again and it seems Google has now found it:

google-update.png (747×708 px, 137 KB)

I checked the other examples I had and it seems to have found at least some of the pages for all of those too.

The important takeaway from this (as per our discussion) was this bit:

Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials.

They'll share more details shortly. We've given them some URLs to debug on their end to see if the behaviour is intended.

@SCherukuwada It's been three months since :) Please let us know if anything was shared.

Here is a summary of our discussions with Google (they proofread this summary):

The web is really large and the search index can simply not include every single page. A page that otherwise has no problems may not be indexed for a myriad of complex reasons, for instance if the indexing process determines that the page is unlikely to be requested in search. This is in line with the Search Central documentation that states: "Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials."

Google also shared a document containing resource links.

They encouraged using SEO Office Hours hosted by Google. And it comes with a disclaimer saying that they might not be able to answer all questions in a given instance.

Here is a summary of our discussions with Google (they proofread this summary):

The web is really large and the search index can simply not include every single page. A page that otherwise has no problems may not be indexed for a myriad of complex reasons, for instance if the indexing process determines that the page is unlikely to be requested in search. This is in line with the Search Central documentation that states: "Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials."

Google also shared a document containing resource links.

They encouraged using SEO Office Hours hosted by Google. And it comes with a disclaimer saying that they might not be able to answer all questions in a given instance.

@SCherukuwada I do understand, however since it is clear that there isn't much help we can expect from Google here, maybe we can try and still identify in house why Wikisources are not being crawled ?