"How Wikipedia uses bots and how bots use Wikipedia are extremely different, however. For years it has been clear that fledgling A.I. systems were being trained on the site’s articles, as part of the process whereby engineers “scrape” the web to create enormous data sets for that purpose. In the early days of these models, about a decade ago, Wikipedia represented a large percentage of the scraped data used to train machines. The encyclopedia was crucial not only because it’s free and accessible, but also because it contains a mother lode of facts and so much of its material is consistently formatted. In more recent years, as so-called Large Language Models, or L.L.M.s, increased in size and functionality — these are the models that power chatbots like ChatGPT and Google’s Bard — they began to take in far larger amounts of information. In some cases, their meals added up to well over a trillion words. The sources included not just Wikipedia but also Google’s patent database, government documents, Reddit’s Q. and A. corpus, books from online libraries and vast numbers of news articles on the web. But while Wikipedia’s contribution in terms of overall volume is shrinking — and even as tech companies have stopped disclosing what data sets go into their A.I. models — it remains one of the largest single sources for L.L.M.s. Jesse Dodge, a computer scientist at the Allen Institute for AI in Seattle, told me that Wikipedia might now make up between 3 and 5 percent of the scraped data an L.L.M. uses for its training. “Wikipedia going forward will forever be super valuable,” Dodge points out, “because it’s one of the largest well-curated data sets out there.” There is generally a link, he adds, between the quality of data a model trains on and the accuracy and coherence of its responses."
January 1, 1970