Dev Ups

Published 2022-10-26 in webdev

Disappearing from google search after changing robots.txt

Drop off in hits from Google since changing robots.txt

I monitor my website's metrics with Google Search Console (GSC).

On the 23rd of October, the performance graph showed data up to the 21st. I noticed my site impressions plummeted to near-zero. I'm a patient man, but after 3 days I set out to discover why.

It takes 20 years to build a reputation and five minutes to ruin it.

-- Warren Buffet

Symptoms

I modified robots.txt 4 days before my impressions tanked. I had forgotten all about it.

I suspected my site map. GSC has a direct link to any property (domain) you own.

I noticed it had status of "failed" and was last read on the 15th of October. Thinking myself pretty canny I submitted the URL again. It failed, stating, "Couldn't fetch".

"Couldn't fetch" led me to this GSC help page which led to another and advice to try the "live test".

Google could not access my sitemap, from their IP, but I still could. I worried that Fail2ban may have banned them, it had not. Therefore, I must have banned them, in robots.txt

robots.txt

I can't guarantee all bots will follow this standard, but I mostly care about Google so re-read it when I discovered my robots.txt was an issue.

I didn't see a problem. Only by using GSC's interactive robots.txt testing tool was the error made clear to me:

robots.txt interactive test

The page displays the date of the current (and previous) revision of robots.txt. Mine said the 18th, but I don't change it a lot so all I needed was to inspect the git history.

I removed Disallow: /*/$ from User-agent: * on the 17th, because GSC complained about some old, trailing-slash URLs being off limits. Foolishly, I had expected the newline and something, anything, the sitemap, to close the scope on the preceding user agents.

That isn't how robots.txt works. User-agent is only satisfied once it has found at least one Allow: or Disallow:. The RHS doesn't matter. I tried both but settled on an empty Disallow not wanting anything so untested as a new directive.

robots.txt interactive test

Enforcement

Many bots reject being included in wildcard expressions. For example, my logs were still full of SemrushBot requests when GoogleBot was abiding.

Openly malicious bots don't care, but legitimate businesses follow these rules, especially if they are explicitly named.

Here for example is how I keep mine short (other bot names removed for brevity):

User-agent: *
Disallow: 

User-agent: AhrefsBot
User-agent: SemrushBot
Disallow: /

sitemap: https://silverbullets.co.uk/sitemap.xml

Conclusion

Fixing robots.txt can be verified in GSC's interactive robots.txt testing tool. This is the only immediate feedback possible. GoogleBot needs a few hours between re-checking robots.txt, it requests mine twice per day.

Drop off in hits from Google since changing robots.txt

The graph shows my impressions continued to grow after I had updated robots.txt, which was late on the 17th of October. From the 19th impressions fall off and no one is clicking. I don't have enough traffic (or confidence in GSC) to be sure this wasn't just an off-day. From the 21st it appears google doesn't want to even show my results in queries.

I put this to the test, on the 25th, before I attempted my fix. I searched for my top ranking queries, for example, "raspberry pi imager error writing to storage". I wasn't in the first 100 results, nor was I for my other top queries. According to GSC, my average position, site-wide is 30.

Curiously, I could find a bunch of pages when I asked for just my site, "site:silverbullets.co.uk", but it appeared that I did not exist to disinterested searchers. When I filtered the results to just the last month I saw one entry, from the 3rd of October.

I read that search metrics appear to domain owners after two days. I don't know what timezone Google is using, and I never before cared, so long as it was consistent.

2nd November: update

The above was written on the 26th of October. 7 days later I'm worrying.

I got stats for the 31st. I'm still flatlining. I still have only one article dated within the past month, the 3rd of October one.

No organic traffic from Google since

On October 31st, I searched my leading term verbatim, "raspberry pi imager error writing to storage while fsync". I'm not listed in all 90-100 results. I'm tempted to blame this searching on the uptick seen in the graph for the 31st, but I didn't make get any impressions from my site back. It would be so helpful to exclude my IP from these results.

I shouldn't expect recovery to be as fast as decline. I blocked GoogleBot the night of the 17th. I didn't bottom out until the 21st. Fixed on the 26th.

I managed to upload 3 new posts last night/this morning. I can see that after a request for /feed at 00:08 this morning, GoogleBot requested all 3 new posts. It still hasn't indexed them, but at least I knew what to look for and spotted it. I'd posted this page on the 26th though, surely, in the interests of timeliness that should be indexed by now.

I checked pages -> page indexing -> indexed pages. It's sorted chronologically, so I can see where activity fell off when I blocked GoogleBot on the 18th:

URL Last crawled
https://silverbullets.co.uk/webdev/gdpr-compliant-nginx-logs 18 Oct 2022
https://silverbullets.co.uk/ci-cd/cleaning-up-after-uninstalling-docker-in-ubuntu 17 Oct 2022
https://silverbullets.co.uk/raspberry-pi/create-a-headless-raspberry-pi-with-minimal-peripherals 17 Oct 2022
https://silverbullets.co.uk/tag/dns 17 Oct 2022

I turn my attention to pages -> page indexing -> Crawled - currently not indexed, it says:

Validation Failed Started: 01/10/2022 Failed: 08/10/2022

Another table sorted by last crawled date shows no activity since 16 Oct 2022. It's about time I give up on Google doing me any favours unbidden. At 11am I clicked "START NEW VALIDATION", despite still having 3 validations pending (Started) since the start of October:

Possible excuses for not indexing my pages.

I Googled this problem and found Why is my page missing from Google Search?:

  1. Search Google for your site or page:

  2. For a missing page: Do a site search with the syntax site:url_of_page

    Examples: site:example.com/petstore/hamsters or https://site:example.com/petstore/hamsters

Further down:

The total time can be anywhere from a day or two to a few weeks, typically, depending on many factors.

My latest post to have been indexed by search, was published on the 13th of October, 20 days ago. I was used to having a latency of no more than a couple of days when everything worked smoothly. My /feed path might be misleading GoogleBot in favour of the perfectly good sitemap.xml I've curated for it. The bot hasn't requested my sitemap for days.

All my recent pages are described as:

Page is not indexed: URL is unknown to Google