Eat Sand, Crawlers

The site is getting absolutely hammered by bots. Arduboy has been growing in popularity in Russia. Coincidence? Maybe. I think someone is scraping Arduboy forum for AI training.

Like, get bent ya hoser. This is costing me thousands in annual hosting fees.

The forum has features to mitigate this, but they are woefully misguided in their design. Ivory tower, no dog food eating nonsense. They have a white list, but that will block any user agent not on it. This means search engines in smaller countries that I don’t know about would not index the site. Maybe that is the arbitrage here.

There is also a rate limiting feature, which would work perfect in this situation but it is only applied to user agents that are specifically targeted for rate limiting. You cannot apply rate limiting to “new” user agents that you have not yet captured.

The easy fix here would be just have an option to rate limit user agents that are not on a white list.

Even better, and I am scratching my head and starting to get a little pissed as to why this isn’t a feature:

Why is there no feature to block or rate limit a user agent after it hits a certain number of views in a certain time??? Like: Wow I’ve never seen this user agent before and it’s generated 100x as much traffic as a normal user, maybe I’ll put them in timeout.

This really is a weak link in discourses forums right now. I doubt I’m the only one dealing with this problem.

Does anyone have any suggestions here? At this point I think I will just have to create a very inclusive white list and just deal with the sad reality I’m forcing people to use the major search engines to find out about Arduboy.

Maybe that’s not a big deal? Am I making a bigger deal out of it than it needs to be?

I mean, Discourse is free, open source and can be self hosted. So I keep running into situations where it would be in my best interest to host Discourse myself. It sort of seems like their support system is designed to push customers in this direction instead of actually addressing their needs. I.e. Discourse hosting seems only interested in taking money from low hanging fruit.

/rant

What do those numbers represent? connections? bandwidth?

I wonder if there is a discourse setting to only allow media (downloads) for signed in users.

As for rate and connection limiting I think you need to do that at a server level but I don’t know Discourse.

If the hosting costs so much, maybe look into alternatives or run discourse on your on VPS?

1 Like

I think @zep got hit a week or two back and at first thought it was a ddos.
Many comments about TikTok’s bot not respecting robots etc.

Edit: Found post here.

2 Likes

Those are page views, and they charge against page views.

Yes, more advanced controls would help, and if I was hosting it myself I might be able to do more about it.

Mostly I’m frustrated because I have asked for support twice before and they were of absolutely no help at all, just sending me a canned response to check their blog post on the subject.

They just came back and said they were aware of this happening more and are discussing it internally.

I suggested my idea of an option to rate limit and not block crawlers not on the white list. I also suggested a community managed, or discourse suggested white list so admins don’t have to play whackamole.

1 Like

Discourse is open source maybe browsing through their repo will lead to some info.

If those crawlers obey robots.txt you could try putting a robots.txt in the root of your webserver with the following content and see if the crawler bots page views reduce.

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

It only allows GoogleBot (and those that don’t obey the rules)

1 Like

I’m using their hosting, these are the only settings I get:

image

image

I have to keep checking this dashboard and sometimes I wake up to find really big surprises and have to add some new user agent to the block list.

But this report has like a hundred legitimate crawlers, so do I sit here for 30 minutes and manually type in each one of them to the white list? Even as a nuclear option that takes forever to do.

2 Likes

I’m not familiar with web hosting but I can’t imagine not having some access to webserver files (how do you backup and restore your webserver for example)

BTW your robot.txt file on your server looks like this it probably reflext the changes you can make on that control panel.

Oh and the community has this one.

It isn’t a webserver in the traditional sense. The discourse instance is hosted on a shared web server. The subdomain is directed at that endpoint and their hosting package does not provide any administration of the actual file server or web host. The only controls are within the server web software itself.

For the main website, it is a similar situation. Except the platform is squarespace.

I do not actually have root access to a web server anywhere except old.arduboy.com all of the current web services are hosted and managed by other service providers.

Sounds like you might not be familiar with how they are hosted, but it’s basically little more than a GeoCities site, I’m just pointing my domain at it.

Yes the robot.txt you have linked to are generated by the services that I pay for. I don’t have control over those or even access to them. They are created as a result of configurations that I set within the applications.

And robots.txt does precious little over this type of behavior if the crawlers don’t respect that file, which many of these would not anyways.

1 Like

Yeah, ByteSpider is a pretty nasty one. I saw this thing a while ago and perhaps it could be helpful.

“test-bot” and “my-tiny-bot” are kind to funny to me because they make it sound like creating an A.I. trainer or whatever is really easy and some random person can do it but I would’ve thought it was a daunting task.

2 Likes

Time to kick Yandex?

Not that I have anything against Russians buying Arduboys, and correlation isn’t necessarily causation, but if there is a link then perhaps it’s worth the hit.

I’ve visited the Discourse forum precisely once, a few years ago, and had pretty much the same experience - got brushed off, topic locked, and then when someone actually tried to help me in a PM the PM was locked off too. Not a good look for Atwood, and a far cry from the cooperativeness of this forum.

I wonder if the money you’d save by being able to block crawlers would be enough to justify using a different host, and whether you could use that fact to ‘persuade’ them to hurry up and fix the problem.

Incidentally that fidget-spinner-bot has been visiting the forum for years. I haven’t dug around the admin panel in quite a while but I distinctly remember it being top even back then.

You could try contacting some other Discourse forum hosts and asking if they’re interested in pooling intelligence to produce a group-maintained whitelist.

If you had greater control over the forum’s backend code, you could possibly build a whitelist by taking note of which crawlers actually request the robots.txt since a bot that isn’t planning to pay attention to it probably wouldn’t bother fetching it.

(Better still, you might even be able to detect which bots actually do respect what the robots.txt says.)


Open source AI software is making it a lot easier.
E.g. TensorFlow has been used pretty openly for the last few years.

The prohibitive factor these days is the hardware rather than the software since you need a powerful computer to crunch large quantities of training data.

Aside from which, there’s a lot of clever programmers floating around the internet.

I’d be curious to know how discourse does their administration. They have their forum there and I don’t know how many of the “mods” are on a payroll? It seems like they are often competing to provide responses.

It seems like there is only a small if maybe a couple “actual” employees, or “admins”? Who have any control?

Like I wonder if it’s something similar to here where it’s just members of the community jumping into provide some support. I think it’s managed more than that, but I’m often getting replies from people who seem like they’ve only been there for a couple days and are tasked with playing whackamole.

Russia is also on the top of new visits in the repo :dizzy_face: maybe their local russian mrbeast promoted it?

1 Like

I found a youtube short published earlier this year in Russian that got 1.7M views so I think that was the start of it.

Any Russian speakers can translate any major takeaways from this? Just curious to see if they say anything unique, obviously they are going over the specs and everything we already know about it.

They don’t have much accessibility to game consoles in that country, and the retro wave is probably just now starting over there, so the ability to build something like this for yourself is probably attractive.

Just set captions to auto translate :wink:

1 Like

Here’s one more. How to wire up an DIY Arduboy in less then 60 seconds :smiley:

Lol with this speed run I didn’t notice it was an UNO instead of a Leonardo. :sweat_smile:

Yes, notice the increase in trying to build with UNO and SH1106?? conicidence?

Let’s be clear - got no problem with Russian people using the website. Just not robots, from anywhere.

2 Likes

I went digging around the channel and discovered that there’s a link to a website advertising a graphical/block programming tool for Arduino.

Translated by Google Translate from the Russian on the YouTube account:

ArduBlock is an innovative block programming language specifically designed to quickly and easily create programs on Arduino platforms. Each block has a unique feature - it contains a piece of code in C++, but is signed in your native language. This allows developers from all over the world, even those without deep programming knowledge, to easily and hassle-free create programs for their DIY projects.

They’re also selling various Arduino project kits.
(No Arduboy kits though, as far as I could see.)


One thing I discovered/realised as a side effect of looking up some software used in the video: technically it’s possible to release closed-source software under the MIT licence because the licence makes no mention of source code, just ‘software’. Weird, but true.

I want an Arduboyblock :face_holding_back_tears:

Vitaly has Arduino leonardo support in his Ardublock 3. I can ask him to add specific support for some particular Arduboy peripherals if needed, as he already did it for ESPboy.
The problem is Ardublock is not focused on gaming, but more on working with famouse Arduino extention modules, robotics and IoT. I’ve asked him a couple of times to add support for working with sprites and collision detection like MicoJS do (by the way, it also has a block programming mode) or Scratch, but he is still experimenting with the simple game code I made for him and still focused on modules.

The video can be watched with subtitles and English translation included. The translation is now quite accurate.

1 Like