I’m currently building a personal website and making a smol-version to along with it. The main website will likely live behind Anubis to avoid LLM-scraping and the like.

Since Anubis forces the client to waste compute, this is not very aligned with the smolweb guidelines as I understand them. How do people handle this?

  • algernon@lemmy.ml
    link
    fedilink
    arrow-up
    1
    ·
    8 days ago

    That doesn’t mean it can’t happen, but also, what actual harm does it do to have a dozen scrapers hitting your site every second? (this is an exaggeration it’s likely not going to be that bad) How big is your smolweb page and images?

    If I were hit by a few dozen scrapers, I wouldn’t care. But I host a few dozen small sites (which all opted out of search engine indexing too), and even today, when I firewall off the worst offenders, I’m still getting 20-25 requests/second a day. Prior to firewalling those off, I had an average of ~300 requests/sec sustained over months, with weekend waves going up to 1400 requests/second. It would’ve gone higher, but at that point, my €4/month VPS was unable to handle the TLS handshakes. At 1400 req/sec, just doing the handshake exhausted what little CPU I had, and I didn’t even serve anything. (At one point, before I implemented automatic firewalling, I scaled the server up, and saw 20k req/sec - stupidly high, because there’s nothing particularly lucrative I host).

    But smolweb? Honestly, I hate to break it to you but nobody cares that much, not even LLMs.

    I’m sorry, they do.

    Anubis potentially makes sense on social media sites like Lemmy that are hosting large numbers of users and user-generated content.

    I don’t think it does. You know what can trivially get through Anubis? A real browser. You know what AI companies have in abundance? ~Infinite money to burn. If they want to get through Anubis, they will. Codeberg saw that happen. Proof of Work doesn’t scale well against the crawlers.