Language fashions like GPT-4 and Claude are highly effective and helpful, however the information on which they’re educated is a carefully guarded secret. The Allen Institute for AI (AI2) goals to reverse this pattern with a brand new, large textual content dataset that’s free to make use of and open to inspection.
Dolma, because the dataset known as, is meant to be the premise for the analysis group’s deliberate open language mannequin, or OLMo (Dolma is brief for “Knowledge to feed OLMo’s Urge for food). Because the mannequin is meant to be free to make use of and modify by the AI analysis neighborhood, so too (argue AI2 researchers) needs to be the dataset they use to create it.
That is the primary “information artifact” AI2 is making accessible pertaining to OLMo, and in a blog post, the group’s Luca Soldaini explains the selection of sources and rationale behind numerous processes the crew used to render it palatable for AI consumption. (“A extra complete paper is within the works,” they observe on the outset.)
Though firms like OpenAI and Meta publish a few of the important statistics of the datasets they use to construct their language fashions, a whole lot of that data is handled as proprietary. Aside from the identified consequence of discouraging scrutiny and enchancment at giant, there’s hypothesis that maybe this closed strategy is because of the information not being ethically or legally obtained: as an example, that pirated copies of many authors’ books are ingested.
You possibly can see on this chart created by AI2 that the biggest and most up-to-date fashions solely present a few of the data {that a} researcher would seemingly wish to find out about a given dataset. What data was eliminated, and why? What was thought of excessive versus low-quality textual content? Have been private particulars appropriately excised?
In fact it’s these firms’ prerogative, within the context of a fiercely aggressive AI panorama, to protect the secrets and techniques of their fashions’ coaching processes. However for researchers exterior the businesses, it makes these datasets and fashions extra opaque and tough to check or replicate.
AI2’s Dolma is meant to be the alternative of those, with all its sources and processes — say, how and why it was trimmed to authentic English language texts — publicly documented.
It’s not the primary to strive the open dataset factor, however it’s the largest by far (3 billion tokens, an AI-native measure of content material quantity) and, they declare, probably the most simple by way of use and permissions. It makes use of the “ImpACT license for medium-risk artifacts,” which you can see the details about here. However primarily it requires potential customers of Dolma to:
- Present contact data and supposed use instances
- Disclose any Dolma-derivative creations
- Distribute these derivatives beneath the identical license
- Agree to not apply Dolma to varied prohibited areas, equivalent to surveillance or disinformation
For individuals who fear that regardless of AI2’s greatest efforts, some private information of theirs could have made it into the database, there’s a removing request kind accessible right here. It’s for particular instances, not only a common “don’t use me” factor.
If that every one sounds good to you, access to Dolma is available via Hugging Face.