OpenAI Training Data to Be Inspected in Authors’ Copyright Cases

OpenAI Training Data to Be Inspected in Authors’ Copyright Cases

As a seasoned gamer and tech enthusiast with a deep understanding of AI and its potential implications, I find myself intrigued by this latest development between OpenAI and the authors suing them. The prospect of gaining access to the training data used by these advanced AI models is a significant step towards transparency and accountability in the industry.


For the very first time, OpenAI is granting permission for an external examination of its training data. The purpose of this review is to determine if any copyrighted materials were utilized in developing their technology.

On Tuesday, those filing lawsuits against Sam Altman’s company and OpenAI announced they had reached agreements on how to examine relevant information. They plan to request specifics about the integration of their works in training data sets, as this could become a key point in the lawsuit that might set boundaries for developing automated chatbots.

The basis of the agreement arises from a series of lawsuits filed by prominent authors like Sarah Silverman, Paul Tremblay, and Ta-Nehisi Coates, who accuse OpenAI of collecting large amounts of books from the internet. These books were allegedly utilized to generate responses that infringe on copyrights by ChatGPT. This development follows a court’s decision in July to dismiss a claim stating that the company was involved in unethical business practices by using content without permission or remuneration. Earlier, U.S. District Judge Araceli Martínez-Olguín had discarded other claims related to negligence, unjust enrichment, and vicarious copyright infringement. However, the authors’ direct copyright infringement claim was left intact.

As an ardent supporter, I’ve often found myself defending AI companies when accusations of wholesale copying arise. Instead, they emphasize that their models are developed by establishing parameters based on existing works to understand what things appear like and how they should be structured. In this ongoing case, OpenAI might eventually employ this argument, along with the claim that the practice of using published works for training purposes falls under the umbrella of fair use. This legal concept protects the utilization of copyrighted material to create a new work, as long as it’s transformative in nature.

As a devoted admirer, I’d like to share some insights about OpenAI. They’ve mentioned that they educate their model using “extensive, publicly accessible datasets containing copyrighted materials.” Last year, they shifted towards keeping the specific resources under wraps, aiming to maintain an edge over competitors and avoid potential legal entanglements. Although we don’t know exactly which works were utilized, authors have noticed that ChatGPT seems adept at crafting summaries and delving deep into the themes of their novels. They suggest that the company may have downloaded hundreds of thousands of books from hidden library sites to fine-tune its AI system.

According to the terms of the contract, OpenAI’s San Francisco office will supply the training datasets on a secure computer that is not connected to the internet or any network. Anyone who wants to access this information must sign a confidentiality agreement, sign in at the visitor’s log, and present proper identification.

Technology usage in the inspection room will be heavily restricted. Devices such as computers, mobile phones, or cameras will not be permitted. OpenAI might allow use of a computer only for note-taking, but lawyers representing the authors must manually transcribe these notes onto another device, under the watchful eye of company representatives at the end of each day. No duplicates of any part of the training data will be tolerated.

According to the document, the team conducting the inspection is allowed to jot down their observations using either handwritten or digital notes on the supplied computer, but they must refrain from directly copying the training data into these notes.

The legal team from the Joseph Saveri Law Firm is leading the court cases. They also support authors in identical copyright disputes against Meta. In these trials, the fact-finding process is scheduled to conclude on September 30; however, an extension request has been submitted. During a hearing last Friday, U.S. District Judge Vince Chhabria raised doubts about whether the attorneys are equipped to effectively represent the authors.

Judge Chhabria, as reported by Politico, made it quite apparent from the documents, the court record, and discussions with the magistrate judge that you’ve presented this case without making significant progress in its development,” he said. “Your team and you have primarily failed to engage in the litigation process. This is evident… This isn’t a typical proposed class action; it’s a crucial case dealing with an essential societal issue. It matters greatly for your clients.

The concern stemmed in part from the lawyers’ failure to conduct any depositions in the case.

As a gamer, I’d put it like this: “Sometimes they say timing is everything, and it seems they were right – even for poor timing. Judge Thomas Hixson pointed out this truth in his writing. The players here asked the court to let them take depositions from 35 parties, excluding third-parties, or a total of 180 hours worth. They made this request just 18 days before the deadline for fact discovery closure.

The judge stated, “Given that the Plaintiffs haven’t conducted any depositions at all, scheduling all 35 party depositions (alongside non-party depositions), or equivalently, the 180 hours of deposition testimony, within the second half of September is clearly unfeasible.

Read More

2024-09-25 02:24