to dirprocess: bots - info rev 28 jun 2021
Category: websites
.......................................................
summaries:
See info about bad crawling bots in process-web-bots-info.html
How to block them in the robots.txt file: process-web-bots-block_with_robots_txt.html
However, since they are by definition 'bad', some bots ignore the robots.txt file.
They can be blocked in the .htaccess file.
"optimize good ones by altering robots.txt;
block bad ones by IP address in .htaccess."
* What is bot traffic, and how to stop?
https://www.cloudflare.com/learning/bots/what-is-a-bot/
https://www.cloudflare.com/learning/bots/what-is-bot-traffic/
.......................................................
Bad crawling bots:
They are analytics aggregators, their data is mostly useful to
the people/companies who suscribe to them.
Requests volume eat too much server resources and bandwidth.
If not restricted to access your website, these bots tend to
obey the delays command in robots.txt.
More than half of web traffic comes from robots, not from real users.
[21 jun 2018]
Blocking them gets you
less spam
safer website
less stolen content
lower bandwidth
Can block in robots.txt, by user agent string - but they can ignore.
Can block at server level, in .htaccess file.
Can block with Google Analytics. (?)
Can block with proxy like Cloudflare.
.......................................................
sources:
* How to block bad website bots and spiders With .htaccess tweaks
https://www.seoblog.com/block-bots-spiders-htaccess/
mar 2018
best one -
explains about the different kinds of bots.
blocking with robots.txt
blocking with .htaccess (on apache)
examples of different syntax.
explanations of the syntax!
- Pre-made lists:
https://www.robotstxt.org/db.html
http://www.botsvsbrowsers.com/
Trouble is these are all old. The bots change. But they give an idea anyway:
https://pastebin.com/5Hw9KZnW
jun 2012
https://tab-studio.com/en/blocking-robots-on-your-page/
dec 2017
https://stackoverflow.com/questions/27431228/how-to-block-bad-bots-in-htaccess
2014, 2018
- How to Identify Robots with Apache Logs
https://www.sumologic.com/insight/apache-logs-identifying-robots/
may 2019
(i've just been checking my awstats and getting user agent strings from there)
https://simtechdev.com/blog/good-and-bad-bots-to-control-to-save-server-resources-and-improve-performance/
Explains about the bots.
Lists user-agent string and summary of bad and good bots.
* List of 1800 bad bots
and very good info.
https://tab-studio.com/en/blocking-robots-on-your-page/
[2017]
List last updated 2017; latest comments dec 2018, promising
an updated list. nothing since then.
.........................
bad bot problems:
* Bot Attacks: You are not alone…
apr 2021
https://medium.com/expedia-group-tech/bot-attacks-you-are-not-alone-d8b3290342bd
* What is bot management? | How bot managers work
https://www.cloudflare.com/en-gb/learning/bots/what-is-bot-management/
.........................
specific bots:
* Facebook Crawler
Crawls the HTML of an app or website that was shared on Facebook
via copying and pasting the link or by a Facebook social plugin.
The crawler gathers, caches, and displays information about
the app or website such as its title, description, and thumbnail image.
You may want to allow the facebook crawler if your site is active on fb.
User agent strings are
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
facebookexternalhit/1.1
TibetSun is getting one called just 'Facebook'.
https://developers.facebook.com/docs/sharing/webmasters/crawler
* oBot
"oBot is the web crawling bot of the Content Security Division of
IBM Germany Research & Development GmbH. ... [results in a]
database that is made available to our customers in several content
filtering products."
https://www.reddit.com/r/bigseo/comments/jigeg5/obot_do_you_know_this_bot/
Original info at http://filterdb.iss.net/crawler/
but it has bad cert; didn't open it. [31 mar 2021]
* SEMrushbot
- https://dmjcomputerservices.com/blog/blocking-semrushbot-from-website/
_______________________________________________________
begin 28 jun 2021
-- 0 --