Smolweb and scraper protection

solbear@slrpnk.net · 14 days ago

Smolweb and scraper protection

solbear@slrpnk.net · 13 days ago

I am not that concerned with the traffic, but I want to dissuade them from adding my texts to their corpus if possible. More out of principles than out of any illusion that it has any real, practical consequences.

Currently my plan is to serve a restrictive robots.txt (which I assume is completely ignored), pass traffic through Anubis with a policy that allows regular browsers without challenge but denies known scrapers (I don’t really think they send truthful user agent strings so this probably won’t do much either), and configure Nginx to be more aggressive with rate limiting. I also plan to license texts under a non-commercial CC license (which I also don’t think will really prevent them).

I wonder if anyone have experience with such setup and can report how much or little this actually does anything?

The full, more interactive and JS-enabled site will issue challenges though.

cecilkorik@lemmy.ca · 13 days ago

Anything you post on the internet should be considered public knowledge, and that includes LLMs. You might be interested in running an LLM poisoning tarpit instead. Debatable whether that’s effective either, but if it’s the principle of the thing that matters to you, that might be a viable alternative that doesn’t do anything to bother legitimate users.

x1gma@lemmy.world · 13 days ago

You’ve answered your question by yourself already. All of those measures prevent, at most, the public and “nice” scrapers. Anything more is pretty close to impossible, if you post something online on a public page, it’s public. Public stuff will be scraped for good, evil, LLM and non-LLM usage.

You can not prevent your public data to be scraped into LLM data. It’s simply not possible. The moment your page gets picked up somewhere, either by scanners, DNS, domain, TLS registry scanners, whatever - it will get scraped. There will be a point where your defenses will fail and answer to a bot posing as a regular user, and your page will get fed to the money printing machine.

The goal of Anubis and similar tools is to make LLM scraping more expensive, and at least prevent LLM scrapers from freeloading on content completely. Blocking scrapers is purely trust based (user agents and similar identification, that can be faked easily) or heuristic/behaviour based, which can never achieve 100% correct detection (e.g. you will always have some malicious requests going through, and some legitimate users getting blocked).

The more restrictions you try and apply, the more your regular users will be impacted, and that’s the trade-off you need to take.