Google not indexing Wikisource properly for years
Open, Needs TriagePublicBUG REPORT
Actions

Assigned To

Authored By

	Darwinius
	Dec 19 2022, 11:43 PM

Description

Steps to replicate the issue (include links if applicable):

Pick a page created months ago on Wikisource, for instance, https://de.wikisource.org/w/index.php?title=Zedler:Puppenwerck&oldid=3795414 (June 2021)
Pick a sentence from the page and search for it on Google.
If that works, then try with a different page (e.g. via Special:Random), because often the URLs used to talk about this bug do end up getting indexed and so no longer show the bug in operation.

What happens?:

No results on Wikisource appear.

What should have happened instead?:

Should be indexed by Google.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Detected first on ws.pt, tested on ws.es, ws.he, ws.fr, ws.de, ws.en, with similar results: Google doesn't seem to be indexing new pages from Wikisource, which greatly diminishes the value of this project, and renders basically impossible any partnership based on it - we are preparing one with the National Library of Portugal, with other tasks running related to a number of other Portuguese archives and repositories.

This one seems to have been an exception, as it appears in the Google search, even if with an outdated version - https://pt.wikisource.org/wiki/Solicita%C3%A7%C3%B5es_do_Bangu%C3%AA/C001/10

Somewhat similar results on Bing.

Can someone have a look at this with the Google Search Console?

Possibly related to this T318046

Related Objects

Mentioned In: T365806: Infinite scroll for articles (split documents on wikisource)
T348203: Google displays “Wikipedia” as site title for some non-Wikipedia pages
T336255: Search Console access request for @Soda for Wikisource
Mentioned Here: T302625: Request Administrator Access to Google Search Console
T318046: CentralNotice severely impacts CLS score
T238090: Search Console access for he.wikisource.org

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Including SRE as it involves Google Search Console

GeneralNotability subscribed.Dec 20 2022, 3:52 PM

Dzahn added a project: Product-Analytics.Dec 20 2022, 7:37 PM

In T325607#8480269, @Albertoleoncio wrote:

Including SRE as it involves Google Search Console

I don't think SRE is actually the right one, unless it's a Search-Console-access-request.

I think it's Production-Analytics or maybe Fundraising-Analysis,also see T238090#8469500.

@Dzahn a request for someone to look into this using the Search Console counts as that?

If you are asking for access to the Search Console, please clarify who needs access to what and add the access request tag. That would make it show up in clinic duty. If you are looking for someone to debug a SEO issue, talk to one of the analytics teams. Cheers

Base subscribed.Dec 20 2022, 11:04 PM

In T325607#8483180, @Dzahn wrote:

If you are asking for access to the Search Console, please clarify who needs access to what and add the access request tag. That would make it show up in clinic duty. If you are looking for someone to debug a SEO issue, talk to one of the analytics teams. Cheers

I think in this case, it's the latter, atleast for now.

The premise is that Wikisource isn't doing as well on Google search with Google taking fairly large amounts of time to index newly created Books/Pages and it would be great to deep dive/debug why that is happening.

Any idea who might be the best person to contact regarding this ?

In T325607#8485221, @Soda wrote:

Any idea who might be the best person to contact regarding this ?

I didn't have an individual name. There is https://phabricator.wikimedia.org/project/members/1163/ but the list is short.

That is basically why I phrased it "one of the analytics teams" at first.

I think the technical answer is "the people who previously asked for access to search console via Phab tickets" and that means maybe https://phabricator.wikimedia.org/project/members/5882/ is the place to look.

From there we can look at the history of closed tickets and get to https://phabricator.wikimedia.org/maniphest/query/F73LiQFG2VAJ/#R

I was going to say maybe try asking https://www.mediawiki.org/wiki/Product_Analytics#Leadership

but then I found this: T302625 which said " I'll handle administration of of our search consoles on various search engines and help Foundation staffers get access to these on a need basis, while tracking who has access and for what purposes."

so. I would say try @SCherukuwada

SCherukuwada claimed this task.Dec 21 2022, 8:23 PM

Dzahn awarded a token.Dec 21 2022, 8:24 PM

I've mostly focused on search performance and stats for the Wikipedias and haven't had a chance to set up and build an understanding of where we are with Wikisource yet. That's why I don't have an immediate answer for this problem.

I need to do a couple of things to first make sure I have access to Wikisource in search console. As soon as that happens I'll dig right into it. Please expect this to happen sometime in the second or third week of January.

A single URL of a work I recently added to ws.pt, and on which I've been working in, appears to have been noticed by Google: https://pt.wikisource.org/w/index.php?title=P%C3%A1gina:Elucidario_Madeirense,_1998,_vol._I.pdf/11&diff=471867&oldid=471865

However, it has no description nor any info about it. Clicking the "More info on this problem" link leads me to the Google SC:

https://support.google.com/webmasters/answer/7489871?hl=pt

It says Google was unable to describe the page because the Wikisource site is actively blocking it.

In T325607#8485491, @SCherukuwada wrote:

I've mostly focused on search performance and stats for the Wikipedias and haven't had a chance to set up and build an understanding of where we are with Wikisource yet. That's why I don't have an immediate answer for this problem.

I need to do a couple of things to first make sure I have access to Wikisource in search console. As soon as that happens I'll dig right into it. Please expect this to happen sometime in the second or third week of January.

Sure that works :)

Bodhisattwa subscribed.Dec 22 2022, 3:58 AM

Page: https://de.wikisource.org/wiki/Zedler:Puppenwerck
Search: https://www.google.de/search?q=nennet+man+%C3%BCberhaupt+alles+Spielwerck

1st result

• PMenon-WMF subscribed.Dec 22 2022, 8:43 AM

@Seddon Was probably indexed in the last couple of days, most probably related to it appearing on this thread, since many pages created before that one are still not indexed by Google.

Examples:
https://de.wikisource.org/wiki/Zedler:Lehn-Sachen

https://de.wikisource.org/wiki/Zedler:Maschine,_(einfache_oder_schlechte)

• Samhaljml triaged this task as Unbreak Now! priority.Jan 3 2023, 8:42 AM

• Samhaljml updated the task description. (Show Details)

Peachey88 lowered the priority of this task from Unbreak Now! to Needs Triage.Jan 3 2023, 8:44 AM

Peachey88 updated the task description. (Show Details)

mpopov moved this task from Triage to Tracking on the Product-Analytics board.Jan 17 2023, 6:14 PM

mpopov added a subscriber: Mayakp.wiki.

Dzahn unsubscribed.Jan 24 2023, 7:21 PM

@SCherukuwada: Hi, any news on this to share? Thanks :)

Seddon unsubscribed.Mar 5 2023, 1:56 PM

Samwilson updated the task description. (Show Details)Mar 14 2023, 2:31 AM

Robertsky subscribed.Mar 23 2023, 2:55 AM

Antanana subscribed.Apr 30 2023, 10:53 AM

OrbiliusMagister subscribed.Apr 30 2023, 5:45 PM

VIGNERON subscribed.May 1 2023, 4:50 PM

Soda mentioned this in T336255: Search Console access request for @Soda for Wikisource.May 9 2023, 11:46 AM

TheresNoTime subscribed.May 11 2023, 1:44 PM

Ijon subscribed.May 11 2023, 2:15 PM

Apologies for the ridiculous delay. I have Wikisource search console access now and am looking at it.

SGill subscribed.May 19 2023, 1:05 PM

Novem_Linguae subscribed.May 21 2023, 1:58 PM

Replayful awarded a token.May 26 2023, 6:59 PM

Replayful subscribed.

@SCherukuwada - Additional context on this ticket. This issue has been reported multiple times in Wikisource Telegram group and also in the Wikisource Community meetings. Volunteers have expressed that if the works they are creating are not going to show up in search results, it is not worth contributing their time and energy on the project.

As far as I can tell, a lot of the pages that aren't appearing in the index are simply not linked to from within the Wiki. There are no sitemaps any more, and Special:Lonelypages is uncrawlable because there's a robots.txt rule blocking all Special: pages from being crawled.

I feel like I'm missing some history here and some history of how we expect these kinds of orphaned/lonely pages to be discovered. I'm going to ask around and get back.

In T325607#8897400, @SCherukuwada wrote:

As far as I can tell, a lot of the pages that aren't appearing in the index are simply not linked to from within the Wiki. There are no sitemaps any more, and Special:Lonelypages is uncrawlable because there's a robots.txt rule blocking all Special: pages from being crawled.

I feel like I'm missing some history here and some history of how we expect these kinds of orphaned/lonely pages to be discovered. I'm going to ask around and get back.

Do we disallow crawling categories, I feel like every page should be in some category which should allow the bots to find lonely pages.

I assume in Wikipedias, the use of Navboxes somewhat mitigates this issue.

In T325607#8897400, @SCherukuwada wrote:

As far as I can tell, a lot of the pages that aren't appearing in the index are simply not linked to from within the Wiki.

On enWS at least, all newly proofread texts are linked for a while on the Main Page under "New Texts" and later on the archive of that list (which is also linked through the Main Page). All texts should also be linked on an associated page in the Author: namespace (listing all the works by that author), and Author: pages should be findable for a crawler through our manually curated alphabetical lists of authors (linked in the sidebar).

There may certainly be texts that fall through these cracks, but some of the major complaints have been regarding texts newly proofread and listed on the main page.

I also think that if good guidance were provided the projects would heed this in designing their practices, if that's what it takes to make our content findable in web search agents.

PS. It occurs to me that to a web crawler a lot of our content will appear at least two places: in the mainspace transclusion and in the source Page:-namespace page. And we can easily have multiple copies of each book page. You don't suppose GoogleBot has us tagged as a SEO-spam site, or downranks us into oblivion because it sees red flags of that?

@Soda Yeah Navboxes would indeed have helped.

Tell me if this makes sense re: categories. If a page isn't in any category or is in a category but isn't otherwise linked to, how would the crawler know how to find the category? As far as I can tell the mechanism for listing all categories (Special:Categories) is also in the special namespace, which is uncrawlable.

@Xover That's very useful to know.

Whatever is linked temporarily on Main: might not necessarily be picked up. Do you have a sense of how long "a while" tends to be? The crawl frequency is adaptive, so there might be days between crawls sometimes. As an anecdote, the Main_Page on en.wikisource has been crawled over 24 hours ago as of this writing.

The Authors link is indeed a good lead, and I wasn't even aware of that. So https://en.wikisource.org/wiki/Category:Authors_by_alphabetical_order was last crawled about 6 days ago.

I'm going to look at some unindexed articles and see if they're linked to authors correctly. In the interim if you have more examples of unindexed articles, that would be helpful. Please try to not paste them as complete links so that they remain uncrawled.

As for guidance, once we get to the bottom of why URLs are continuing to remain unindexed we'll simply turn this into guidance that we can then disseminate.

I can't see that we're getting downranked into oblivion - at least not just yet. If we are, that'll be the next problem to solve once we're getting indexed and ranked at all. :-)

Here are some unindexed articles (confirmed from search console and from Google). I came upon them by simply hitting "Zufällige Seite" (Random Page) on de Wikisource and checking if the resulting page is indexed at all.

de.ws /w/index.php?title=Seite:Handbuch_der_Politik_Band_3.pdf/393

dw.es /wiki/Seite:Die_Staats-Vertr%C3%A4ge_des_K%C3%B6nigreichs_Bayern_von_1806_bis_1858.pdf/219

Looking at why none of the pages of this work have been indexed.

I've been through dozens of random articles on the French wikisource and couldn't find a single unidnexed one. @SGill do you have any unindexed ones on any Wikisource that I can examine?

@SCherukuwada I have an interesting one: [[:s:pt:O Movimento Modernista]]

Looking for

"O Movimento Modernista" site:wikisource.org

on google, it shows that the crawler noticed the link on the Main Page of the project and even on the author's page, but did not "enter" the page of the work.

@SCherukuwada , this one on Bengali Wikisource is currently on Main Page, but while searching on Google search, only the index page is showing up.

Thank you for the supporting links.

Having discussed this internally with other Foundation staff, there seems to be some fragmented knowledge around how data dumps may be a part of this puzzle that we need to put together to understand what's happening. I'm trying to get to the bottom of that. As soon as I have answers I'll respond here. I expect to hear back from some people in the next couple of days.

Again, thank you all for your patience.

OK, having talked to some folks in the Enterprise org and other teams and having eliminated a few possible problems, the one I'm investigating now is the possibility that for some reason Google's bot is ratelimiting itself. I'll continue to post any findings here.

In T325607#8898528, @SCherukuwada wrote:

@Soda Yeah Navboxes would indeed have helped.

Tell me if this makes sense re: categories. If a page isn't in any category or is in a category but isn't otherwise linked to, how would the crawler know how to find the category? As far as I can tell the mechanism for listing all categories (Special:Categories) is also in the special namespace, which is uncrawlable.

In all the examples I've tried, Google clearly has found the category since it includes the category in the search results, but for some reason hasn't indexed any of the pages in the category, while DuckDuckGo finds everything - the category, the file on Commons, the index page, the individual pages, the transcluded version in the main namespace...

An example:
Here's what I get when searching for "Lohengrin (V. Busuttil)" (deliberately not adding any links because I don't want to affect the search results) in Google vs DuckDuckGo:

Google returns only two results, one of which highlights the page name that it should have indexed but didn't. DuckDuckGo returns 26 results in total (only the first three are shown), 25 pages on Wikisource plus the file page on Commons.

In T325607#8997709, @Nikki wrote:

In T325607#8898528, @SCherukuwada wrote:

@Soda Yeah Navboxes would indeed have helped.

Tell me if this makes sense re: categories. If a page isn't in any category or is in a category but isn't otherwise linked to, how would the crawler know how to find the category? As far as I can tell the mechanism for listing all categories (Special:Categories) is also in the special namespace, which is uncrawlable.

In all the examples I've tried, Google clearly has found the category since it includes the category in the search results, but for some reason hasn't indexed any of the pages in the category, while DuckDuckGo finds everything - the category, the file on Commons, the index page, the individual pages, the transcluded version in the main namespace...

An example:
Here's what I get when searching for "Lohengrin (V. Busuttil)" (deliberately not adding any links because I don't want to affect the search results) in Google vs DuckDuckGo:

Google returns only two results, one of which highlights the page name that it should have indexed but didn't. DuckDuckGo returns 26 results in total (only the first three are shown), 25 pages on Wikisource plus the file page on Commons.

Yeah the general pattern I am seeing on the search console mirrors this behaviour, I wonder if we are failing the "Content Quality" or some such similar criteria on categories (since it is effectively a bunch of links)

mrephabricator subscribed.Jul 13 2023, 5:00 PM

Mike_Peel subscribed.Aug 1 2023, 6:58 AM

CptViraj subscribed.Aug 1 2023, 5:21 PM

Effeietsanders awarded a token.Aug 2 2023, 10:02 PM

Effeietsanders subscribed.

Mail threads have been started on this issue at wikimedia-l and wikisource-l mailing lists.

Nicholas_Perry subscribed.Aug 7 2023, 11:05 AM

This has been reported to Google. We're waiting for them to get back.

Samwalton9-WMF subscribed.Aug 15 2023, 4:17 AM

Vituzzu subscribed.Aug 23 2023, 9:34 AM

Niharika subscribed.Aug 23 2023, 4:32 PM

We're meeting with them in the next couple of weeks to troubleshoot our scraping problems. Will report back once we learn more.

Nehaoua subscribed.Aug 26 2023, 10:46 AM

Sanqui subscribed.Sep 15 2023, 9:00 PM

Celenduin subscribed.Sep 16 2023, 11:13 AM

We met with Google to discuss this further. Google will provide more details on this soon, but the crux of the matter is that not all pages are guaranteed to be crawled, indexed, and served, as is stated on Search Central documentation.

In T325607#9214743, @SCherukuwada wrote:

We met with Google to discuss this further. Google will provide more details on this soon, but the crux of the matter is that not all pages are guaranteed to be crawled, indexed, and served, as is stated on Search Central documentation.

The article that you linked lists the following:

Network issues
Too much Javascript
Server problems
Disallowing via robots.txt

AFAIK, robots.txt isn't the problem here, any way we can debug and see if any of the others are the issue ?

The important takeaway from this (as per our discussion) was this bit:

Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials.

They'll share more details shortly. We've given them some URLs to debug on their end to see if the behaviour is intended.

R4356th mentioned this in T348203: Google displays “Wikipedia” as site title for some non-Wikipedia pages.Oct 22 2023, 7:30 PM

Tahmid subscribed.Nov 23 2023, 2:29 PM

In T325607#8997709, @Nikki wrote:

An example:
Here's what I get when searching for "Lohengrin (V. Busuttil)" (deliberately not adding any links because I don't want to affect the search results) in Google vs DuckDuckGo:

Google returns only two results, one of which highlights the page name that it should have indexed but didn't. DuckDuckGo returns 26 results in total (only the first three are shown), 25 pages on Wikisource plus the file page on Commons.

I just checked this one again and it seems Google has now found it:

I checked the other examples I had and it seems to have found at least some of the pages for all of those too.

In T325607#9218609, @SCherukuwada wrote:

The important takeaway from this (as per our discussion) was this bit:

Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials.

They'll share more details shortly. We've given them some URLs to debug on their end to see if the behaviour is intended.

@SCherukuwada It's been three months since :) Please let us know if anything was shared.

Here is a summary of our discussions with Google (they proofread this summary):

The web is really large and the search index can simply not include every single page. A page that otherwise has no problems may not be indexed for a myriad of complex reasons, for instance if the indexing process determines that the page is unlikely to be requested in search. This is in line with the Search Central documentation that states: "Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials."

Google also shared a document containing resource links.

They encouraged using SEO Office Hours hosted by Google. And it comes with a disclaimer saying that they might not be able to answer all questions in a given instance.

Resources on Google Search.pdf33 KBDownload

Yann subscribed.Jan 19 2024, 7:07 PM

Frostly subscribed.Jan 21 2024, 10:28 PM

hubaishan subscribed.Jan 22 2024, 12:09 PM

In T325607#9440813, @SCherukuwada wrote:

Here is a summary of our discussions with Google (they proofread this summary):

The web is really large and the search index can simply not include every single page. A page that otherwise has no problems may not be indexed for a myriad of complex reasons, for instance if the indexing process determines that the page is unlikely to be requested in search. This is in line with the Search Central documentation that states: "Google doesn't guarantee that it will crawl, index, or serve your page, even if your page follows the Google Search Essentials."

Google also shared a document containing resource links.

They encouraged using SEO Office Hours hosted by Google. And it comes with a disclaimer saying that they might not be able to answer all questions in a given instance.

Resources on Google Search.pdf33 KBDownload

@SCherukuwada I do understand, however since it is clear that there isn't much help we can expect from Google here, maybe we can try and still identify in house why Wikisources are not being crawled ?

Candalua subscribed.Feb 19 2024, 1:19 PM

Soda mentioned this in T365806: Infinite scroll for articles (split documents on wikisource).May 24 2024, 2:27 PM

This article might be relevant

Prototyperspective subscribed.Jul 31 2024, 9:01 PM

	F41651521: Resources on Google Search.pdf
	Jan 8 2024, 10:36 AM

	F41631008: google-update.png
	Dec 24 2023, 4:19 AM

	F37132282: duckduckgo.png
	Jul 7 2023, 3:45 PM

	F37132284: google.png
	Jul 7 2023, 3:45 PM

	F35887099: image.png
	Dec 22 2022, 10:33 AM

Google not indexing Wikisource properly for yearsOpen, Needs TriagePublicBUG REPORTActions

Description

Related Objects

Event Timeline

Google not indexing Wikisource properly for years
Open, Needs TriagePublicBUG REPORT
Actions