Smolweb and scraper protection

solbear@slrpnk.net · 13 days ago

Smolweb and scraper protection

algernon@lemmy.ml · 7 days ago

I need to join more communities, because I’m noticing these anti-scraper questions way too late.

I’d like to direct your attention to iocaine. It’s somewhat similar to Anubis in the sense that it sits between your reverse proxy and the real content, but unlike Anubis, it does not use proof of work. It exploits the fact that most of the scrapers are incredibly dumb, and can be trivially detected:

Is it in ai.robots.txt’s list? It’s a crawler.
Does it have Firefox/ or Chrome/ in the user agent, but sent no sec-fetch-mode header? Pretty much guaranteed to be a crawler, with few exceptions (eg, Googlebot, Bingbot - but I’d classify those as hostile crawlers too)

Serve garbage or a static page with poisoned URLs to these, and you got rid of 90%+ of the bots. Why the poisoned URLs? Because when they come back riding headless chromes, they usually crawl URLs the dumb bots collected. If you poison those URLs in a way that you can identify them trivially, you can block the headless chromes too, which you wouldn’t be able to detect otherwise. Whether they come through residential proxies or not, as long as their queue is collected by the dumb bots, you can catch them.

On top of this, to reduce the load on your servers, iocaine can also block requests. It can be configured to serve garbage & poisoned URLs to the dumb bots, and then firewall anything that hits a poisoned URL.

The false positive rate is surprisingly low.

cecilkorik@lemmy.ca · 13 days ago

Short answer: You can’t. Longer answer: Accept it if/when it happens, but don’t make yourself an attractive target, and don’t put yourself in a position where it’s going to cost you money if it does. The hype of LLM scrapers is largely overblown for small personal static pages. LLM training wants fresh, data-heavy content. If they are scraping your smolweb site you’re either updating it with and hosting rich content far too frequently, or it’s an error on their part out of pure ignorance and laziness. That doesn’t mean it can’t happen, but also, what actual harm does it do to have a dozen scrapers hitting your site every second? (this is an exaggeration it’s likely not going to be that bad) How big is your smolweb page and images? A few dozen kilobytes? What’s your bandwidth limit, and what happens when you hit the cap? If you’re worried about hitting the cap too quickly, this can be straightforwardly managed by per-IP rate limiting and throttling if necessary to keep things under a cap and allow fair access to gentler users. But when you’re only hosting small files, most connections have plenty of bandwidth to handle scraping until they realize how pointless it is and give up, and it probably won’t be necessary.

I run about 20 small websites, all public and searchable, with no protections at all. Most of them are rarely updated and have been static for years, I just checked my traffic logs for the last day: ~14,000 hits. That may sound like a lot, but for a request that takes milliseconds to deliver, a computer sitting around not doing anything for the many seconds in between each of those requests is probably bored. Many different scrapers are obviously buried in that traffic, but they’re not the overwhelming horror that people make them out to be, at least in my experience.

Anubis potentially makes sense on social media sites like Lemmy that are hosting large numbers of users and user-generated content. This stuff is like manna from heaven for LLM bots. Same with code repositories like forgejo. They are very attractive targets for scrapers, with lots of frequent updates that require frequent scraping and also lots of very large files for it to download and ingest. This will absolutely hammer your bandwidth if the scrapers find you an attractive target and they are stupid (which they are).

But smolweb? Honestly, I hate to break it to you but nobody cares that much, not even LLMs.

algernon@lemmy.ml · 7 days ago

That doesn’t mean it can’t happen, but also, what actual harm does it do to have a dozen scrapers hitting your site every second? (this is an exaggeration it’s likely not going to be that bad) How big is your smolweb page and images?

If I were hit by a few dozen scrapers, I wouldn’t care. But I host a few dozen small sites (which all opted out of search engine indexing too), and even today, when I firewall off the worst offenders, I’m still getting 20-25 requests/second a day. Prior to firewalling those off, I had an average of ~300 requests/sec sustained over months, with weekend waves going up to 1400 requests/second. It would’ve gone higher, but at that point, my €4/month VPS was unable to handle the TLS handshakes. At 1400 req/sec, just doing the handshake exhausted what little CPU I had, and I didn’t even serve anything. (At one point, before I implemented automatic firewalling, I scaled the server up, and saw 20k req/sec - stupidly high, because there’s nothing particularly lucrative I host).

But smolweb? Honestly, I hate to break it to you but nobody cares that much, not even LLMs.

I’m sorry, they do.

Anubis potentially makes sense on social media sites like Lemmy that are hosting large numbers of users and user-generated content.

I don’t think it does. You know what can trivially get through Anubis? A real browser. You know what AI companies have in abundance? ~Infinite money to burn. If they want to get through Anubis, they will. Codeberg saw that happen. Proof of Work doesn’t scale well against the crawlers.

solbear@slrpnk.net · 13 days ago

I am not that concerned with the traffic, but I want to dissuade them from adding my texts to their corpus if possible. More out of principles than out of any illusion that it has any real, practical consequences.

Currently my plan is to serve a restrictive robots.txt (which I assume is completely ignored), pass traffic through Anubis with a policy that allows regular browsers without challenge but denies known scrapers (I don’t really think they send truthful user agent strings so this probably won’t do much either), and configure Nginx to be more aggressive with rate limiting. I also plan to license texts under a non-commercial CC license (which I also don’t think will really prevent them).

I wonder if anyone have experience with such setup and can report how much or little this actually does anything?

The full, more interactive and JS-enabled site will issue challenges though.

cecilkorik@lemmy.ca · 13 days ago

Anything you post on the internet should be considered public knowledge, and that includes LLMs. You might be interested in running an LLM poisoning tarpit instead. Debatable whether that’s effective either, but if it’s the principle of the thing that matters to you, that might be a viable alternative that doesn’t do anything to bother legitimate users.

x1gma@lemmy.world · 13 days ago

You’ve answered your question by yourself already. All of those measures prevent, at most, the public and “nice” scrapers. Anything more is pretty close to impossible, if you post something online on a public page, it’s public. Public stuff will be scraped for good, evil, LLM and non-LLM usage.

You can not prevent your public data to be scraped into LLM data. It’s simply not possible. The moment your page gets picked up somewhere, either by scanners, DNS, domain, TLS registry scanners, whatever - it will get scraped. There will be a point where your defenses will fail and answer to a bot posing as a regular user, and your page will get fed to the money printing machine.

The goal of Anubis and similar tools is to make LLM scraping more expensive, and at least prevent LLM scrapers from freeloading on content completely. Blocking scrapers is purely trust based (user agents and similar identification, that can be faked easily) or heuristic/behaviour based, which can never achieve 100% correct detection (e.g. you will always have some malicious requests going through, and some legitimate users getting blocked).

The more restrictions you try and apply, the more your regular users will be impacted, and that’s the trade-off you need to take.