Some thoughts on how useful Anubis really is. Combined with comments I read elsewhere about scrapers starting to solve the challenges, I’m afraid Anubis will be outdated soon and we need something else.
Some thoughts on how useful Anubis really is. Combined with comments I read elsewhere about scrapers starting to solve the challenges, I’m afraid Anubis will be outdated soon and we need something else.
Even if one would want to give them everything, they don’t care. They just burn through their resources and recursively scrape every single link on your page. Providing standardized database dumps is absolutely not helping against your server being overloaded by scrapers of various companies with deep pockets.
Like anubis, that’s not going to last, the point isn’t to hammer the web servers off the net, it’s to get the precious data. The more standardized and streamlined that’s going to be made and only if there’s no preferential treatment to certain players (open ai / google facebook) then the dumb scraper will burn themselves out.
One nice thing about anubis and nepenthes is that it’s going to burn out those dumb scrapers faster and force them to become more sophisticated and stealth. That’s should resolve the ddos problem on its own.
For the truly public data sources, I think coordinated database dumps is the way to go, for hostile carrier, like reddit and facebook, it’s going to be scrapper arms race warfare like Cory Doctorow predicted.