Arun Shah™@lemmy.world to Technology@lemmy.ml · 1 month agoTop 10 website in the world ( July 2025 )lemmy.mlimagemessage-square31linkfedilinkarrow-up199arrow-down110
arrow-up189arrow-down1imageTop 10 website in the world ( July 2025 )lemmy.mlArun Shah™@lemmy.world to Technology@lemmy.ml · 1 month agomessage-square31linkfedilink
minus-squareEager Eagle@lemmy.worldlinkfedilinkEnglisharrow-up16arrow-down1·edit-21 month agoThere are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates. What bothers me is that they – apparently – crawl instead of just… using the API, like: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2 I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.
minus-squareclb92@feddit.dklinkfedilinkarrow-up10·1 month ago valid reasons for not wanting the whole database e.g. storage constraints If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.
There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.
What bothers me is that they – apparently – crawl instead of just… using the API, like:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2
I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.
If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.