Top 10 website in the world ( July 2025 )

Arun Shah™@lemmy.world · 1 month ago

Top 10 website in the world ( July 2025 )

Eager Eagle@lemmy.world · edit-2 1 month ago

There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.

What bothers me is that they – apparently – crawl instead of just… using the API, like:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2

I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.

clb92@feddit.dk · 1 month ago

valid reasons for not wanting the whole database e.g. storage constraints

If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.