Anything you post on the internet should be considered public knowledge, and that includes LLMs. You might be interested in running an LLM poisoning tarpit instead. Debatable whether that’s effective either, but if it’s the principle of the thing that matters to you, that might be a viable alternative that doesn’t do anything to bother legitimate users.
- 1 Post
- 135 Comments
Short answer: You can’t. Longer answer: Accept it if/when it happens, but don’t make yourself an attractive target, and don’t put yourself in a position where it’s going to cost you money if it does. The hype of LLM scrapers is largely overblown for small personal static pages. LLM training wants fresh, data-heavy content. If they are scraping your smolweb site you’re either updating it with and hosting rich content far too frequently, or it’s an error on their part out of pure ignorance and laziness. That doesn’t mean it can’t happen, but also, what actual harm does it do to have a dozen scrapers hitting your site every second? (this is an exaggeration it’s likely not going to be that bad) How big is your smolweb page and images? A few dozen kilobytes? What’s your bandwidth limit, and what happens when you hit the cap? If you’re worried about hitting the cap too quickly, this can be straightforwardly managed by per-IP rate limiting and throttling if necessary to keep things under a cap and allow fair access to gentler users. But when you’re only hosting small files, most connections have plenty of bandwidth to handle scraping until they realize how pointless it is and give up, and it probably won’t be necessary.
I run about 20 small websites, all public and searchable, with no protections at all. Most of them are rarely updated and have been static for years, I just checked my traffic logs for the last day: ~14,000 hits. That may sound like a lot, but for a request that takes milliseconds to deliver, a computer sitting around not doing anything for the many seconds in between each of those requests is probably bored. Many different scrapers are obviously buried in that traffic, but they’re not the overwhelming horror that people make them out to be, at least in my experience.
Anubis potentially makes sense on social media sites like Lemmy that are hosting large numbers of users and user-generated content. This stuff is like manna from heaven for LLM bots. Same with code repositories like forgejo. They are very attractive targets for scrapers, with lots of frequent updates that require frequent scraping and also lots of very large files for it to download and ingest. This will absolutely hammer your bandwidth if the scrapers find you an attractive target and they are stupid (which they are).
But smolweb? Honestly, I hate to break it to you but nobody cares that much, not even LLMs.
cecilkorik@lemmy.cato
Selfhosted@lemmy.world•GitHub - minio/minio: "This repository was archived by the owner on Apr 25, 2026. It is now read-only."English
1·18 days agoI don’t know where you’re seeing this, maybe this is a Fediverse thing, but I’m literally on Selfhosted@lemmy.world right now.
cecilkorik@lemmy.cato
Fediverse@lemmy.world•The Seven Deadly Fediverse UX Sins: A Redemption Report Card - We DistributeEnglish
6·20 days agoI believe you, and I feel for you. The saddest part about AI is how it has tainted all high-effort, carefully organized work to the point that it makes it hard to distinguish between the most trustworthy content and the least trustworthy. We need better tools for information provenance. Like I said, the first thing I did was look into your backgrounds to try to understand “is this some AI slop bot or a real person with a real brain” and everything I looked at suggested it being legit and that’s why I decided to give you the benefit of the doubt. But that doubt is everywhere nowadays. It’s rough out there.
cecilkorik@lemmy.cato
Fediverse@lemmy.world•The Seven Deadly Fediverse UX Sins: A Redemption Report Card - We DistributeEnglish
49·20 days agoThe formatting, style, cadence and tone feel very AI to me. The authors seem like real people with real history and I’m willing to give them the benefit of the doubt, the topic and status summarization is genuinely interesting whether it’s AI or not, but it’s hard not to feel a bit sus reading it.
cecilkorik@lemmy.cato
Selfhosted@lemmy.world•GitHub - minio/minio: "This repository was archived by the owner on Apr 25, 2026. It is now read-only."English
1·20 days agoThat’s great for production AWS managed services, but that still sounds like the opposite of self-hosting to me, I don’t need scaling like that, I’m not lying when I admit I’m using sshfs (which was a slightly tongue-in-cheek counterpoint to s3) and despite everyone dunking on it, it is in fact working perfectly at my scale. I know I’ve been downvoted to purgatory but I still stand by my original comment. I don’t understand why you would need S3 or S3 compatibility in a self-hosting context. The closest someone has come to explaining it is the guy who said choice is good… like, yeah, it’s good to have the choice I guess, but… still doesn’t seem like a great choice for self-hosting. I appreciate you trying to explain it but I feel like everyone is missing the self-hosting context here. For a little home lab I simply don’t see the value. Why are people promoting AWS and AWS-adjacent services here?
cecilkorik@lemmy.cato
Selfhosted@lemmy.world•GitHub - minio/minio: "This repository was archived by the owner on Apr 25, 2026. It is now read-only."English
8·24 days agoSo enlighten me then, save me from my terrible hack that is working fine for me and tell me what it DOES have to do with. I thought S3 was a remote filesystem you can use, essentially Amazon’s proprietary version of webdav where you get a http bucket you can only access with aws proprietary tools. What’s the attraction? Clearly it seems like people love it, and I am getting dunked on for asking an honest question, which feels a bit unhealthy and unpleasant for the self-hosting community.
Am I supposed to be familiar with AWS infrastructure as a prerequisite for being here?
cecilkorik@lemmy.cato
Fediverse@lemmy.world•Can federating be modified so it's not dependent domain names?English
1·24 days agoYes that’s exactly what he said, all the worst of humanity. (partial sarcasm. maybe.)
cecilkorik@lemmy.cato
Selfhosted@lemmy.world•GitHub - minio/minio: "This repository was archived by the owner on Apr 25, 2026. It is now read-only."English
146·24 days agoS3 compatibility is nice I guess if you need S3 compatibility but also… why would you need that?
sshfs does everything I need and compatibility is almost native.
He’s still right that it’s weird people like you are going to bat to defend them. Microsoft sucks. It must get tiring if you have to call out every inaccurate thing everyone says to try to tear them down. The important take-home message is that we need to tear them down, they suck.
You don’t see people bothering to defend Epstein, for example. Even though there’s lots of inaccurate stuff going around, there’s enough accurate stuff to be absolutely confident he was an absolute loathsome piece of shit not worth defending. Not worth the effort to defend. Why bother?
What do you see in Microsoft that you think is worth defending? Github is shit, and it’s evil. Let it go.
Private trackers are like the Matrix’s “zion”. When civilization collapses into a dystopian surveillance capitalism hellscape and the AIs and fascist governments take over the net, the last free humans will be hiding in private tracker communities, sharing freely and building a resistance. Will we have mechs with gatling guns? I don’t know, all I can say is I hope so because it looks like we’re going to need it.
Like nuclear fusion, IPv6 is one of those things that feels like it’s constantly about 20 years away, no matter how long we work on it.
homelab is a form of self-hosting and vice versa as far as I’m concerned. Ask away, I’ve never seen that rule being strictly enforced and I don’t think the lemmy community is honestly large enough to support such a rule. It was probably migrated over from Reddit where there were viable communities for all those things.
cecilkorik@lemmy.cato
Free and Open Source Software@beehaw.org•All my software including games now have a license grant for people to use the polyform noncommercial license or polyform strict license instead of the GNU GPL v3.0English
2·1 month agoFrom Wikipedia: “The Time Cube website did not have a navigation structure such as a menu or a central home page; instead, it was one long continuous page.”
cecilkorik@lemmy.cato
Free and Open Source Software@beehaw.org•All my software including games now have a license grant for people to use the polyform noncommercial license or polyform strict license instead of the GNU GPL v3.0English
3·1 month agoWhat in the name of Timecube is that webpage.
cecilkorik@lemmy.cato
Selfhosted@lemmy.world•Checking....what's the status for FOSS agentic AI models with skills?English
2·1 month agoIn that case I’d definitely recommend taking a look at pi, it’s a fairly minimal and controllable starting point where you’re in the driver’s seat at all times and most “features” are opt-in and handled responsibly. And since it’s extensible you can even use plugins like the ones here to do things like add more protections against undesired actions if you want and if that is too minimal and you eventually realize you want something a little bit more like OpenClaw you might want to look into Hermes-Agent, which has similar comprehensiveness to OpenClaw but seems to be a lot more responsibly designed. I don’t have any personal experience with it but that seems to be what most of the “security-thoughtful AI keeners” (which I feel is a bit of a contradiction but people seem to be having some success with it) are using these days.
cecilkorik@lemmy.cato
Selfhosted@lemmy.world•Checking....what's the status for FOSS agentic AI models with skills?English
11·1 month agoAbsolutely. There are tons of open-licenced, open-weight (the equivalent of open-source for AI models) models capable of what is called “tool usage”. The key thing to understand is that they’re never quite perfect, and they don’t all “use tools” quite as effectively or in the same way as each other. This is common to LLMs and it is critical to understand that at the end of the day they are just text generators, they do not “use tools” themselves. They create specific structured text that triggers some other software, typically called a harness but could also be called a client or frontend, to call those tools on your system. Openclaw is an example of such a harness (and not a great or particularly safe one in my opinion but if you want to be a lunatic and give an AI model free reign it seems to be the best choice) You can use commercial harnesses too by configuring or tricking them into connecting to a local model instead of their commercial one, although I don’t recommend this for a variety of reasons if you really want to use claude code itself people have done it but I don’t find it works very well since all its prompts and tool calling is optimized for Claude models. Besides OpenClaw, Other popular harnesses for local models include OpenCode (as close as you’re going to get to claude for local models) or Cursor, even Ollama has their own CLI harness now. Personally I use OpenCode a lot but I am starting to lean towards pi-mono (it’s just called pi but that’s ungoogleable) it is very minimal and modular, making it intentionally easy to customize with plugins and skills you can automatically install to make it exactly as safe or capable or visual as you wish it to be.
As a minor diversion we should also discuss what a “tool” is, in this context there are some common basic tools that some or most tool-use models will have or understand some variation of, out of the box. Things like editing files, running command-line tools, opening documents, searching the web, are common built-in skills that pretty much any model advertising itself capable of “tool use” or “tool calling” will support, although some agents will be able to use these skills more capably and effectively than others. Just like some people know the Linux commandline fluently and can completely operate their system with it, while others only know basic commands like
lsorcatand need a GUI or guidance for anything more complex, AI models are similar, some (and the latest models in particular) are incredibly capable with even just their basic built-in tools. However they’re not limited by what’s built in, as like I said, they can accept guidance on what to use and how to use it. You can guide them explicitly if you happen to be fluent in their tools, but there are kind of two competing models for how to give them that guidance automatically. These are MCP (model context protocol) which is a separate server they can access that provides structured listings of different kinds of tools they can learn to use and how they work, basically allowing them to connect to a huge variety of APIs in almost any software or service. Some harnesses have an MCP built-in. The other approach is called “skills” and seems to be (to me) a more sensible and flexible approach to giving the AI model enough understanding to become more capable and expand the tools it can use. Again, providing skills is usually something handled by the harness you’re using.To make this a little less abstract you can put it in perspective of Claude: Anthropic provides several different Claude models like Haiku, Sonnet, and Opus. These are the text-generation models and they have been trained to produce a particular tool usage format, but Opus tends to have more built-in capability than something like Haiku for example. Regardless of which model you choose though (and you can switch at any time) you’ll be using a harness, typically “claude code” which is typically the CLI tool most people use to interact with Claude in an agentic, tool calling capacity.
On the open and local side of the landscape, we don’t have anything quite as fast or capable as Claude code unfortunately, but we can do surprisingly okay considering we’re running small local models on consumer hardware, not massive data center farms being enticingly given away or rented for pennies on the dollar of what they’re actually costing these companies on the hopes of successful marketshare-capture and vendor-lock-in leading to future profits.
Here are some pretty capable tool-use models I would recommend (most should be available for download through ollama and other sources like huggingface)
- gemma4 (the latest and greatest hotness, MIT licensed using TurboQuant to deliver pretty incredible capability, performance and results even with limited VRAM)
- qwen3.5 (from Alibaba, a consistent and traditional leader in open models so far with good capability and modest performance)
- qwen3-coder-next (a pretty huge coding-focused model you might struggle to run unless you have a very beefy system and GPU)
- glm4.7-flash (a modestly capable and reasonably fast option)
- devstral-small-2 (an older, not-so-small variant of mistral, the French open-weight AI model if you’re looking for a non-Chinese, non-US based model which are few and far between)
Great news! Also thanks for providing the follow-up, hopefully it helps people who use Unraid in the future.
You can do all those things with proper routing and there is no difference from mobile devices (as long as they use DHCP and what mobile device wouldn’t?). What I’m suggesting does not change anything on the public side. You still authenticate publicly to renew your certificates. You still give the same certificates to both public and local networks. They’re still valid. Nothing changes.
The only difference is that when you’re local, your DNS gives you the correct local IP address where that service is hosted, say, 192.168.12.34 instead of using public DNS, getting an external IP that’s on the wrong side of the router, and having to go outside your own network and come back in. Hairpin is like that simpsons episode where Abe goes in the revolving door, takes off his hat, puts his hat back on, and goes back out the same revolving door in the span of 2 seconds. It’s pointless, why are you doing that? If you didn’t want to be on the outside of the network, why are you going to the outside of the network first? Just stay inside the network. Get the right IP. No hairpin routing needed. No certificate madness needed. Everything just works the way its supposed to (because this is in fact the way it’s supposed to work)

Way too much. The Nvidia P40 I scavenged for my homemade AI server runs at 120W (throttled down from 250W default) on its own. Then I’ve got two more PCs running purely as redundant firewalls with automatic failover, pretty unnecessary but if that’s not the sort of thing homelabbing is for then I’m going to keep doing it wrong because I find it fun. Then there’s the minecraft server, which is pretty beefy and also eternally running at max CPU because my niece is a monster who loves spamming spawn eggs and should never be allowed access to creative mode. And I don’t even have the two rack units of disk arrays I bought at auction powered up yet because they need 240V which I don’t have handy. I guess someone could do the math on what 48 enterprise SAS drives will pull if they need to satisfy their curiosity, I’m not sure I want to. I will hook them up someday but for now ignorance is bliss. All I know is it’s a lot, and there’s stuff I’m not even including in this.