Lord Faulks extracts from Data (Use and Access) Bill [HL] (18th December 2024)

Data (Use and Access) Bill [HL]

Lord Faulks Excerpts

Future Day:
Mon 13th Jan 2025
Grand Committee

Committee stage
Wednesday 18th December 2024

(1 year, 2 months ago)

Grand Committee

Read Full debate Data (Use and Access) Act 2025 View all Data (Use and Access) Act 2025 Debates Read Hansard Text Read Debate Ministerial Extracts Amendment Paper: HL Bill 40-IV Fourth marshalled list for Grand Committee - (17 Dec 2024)

We need transparency and a clear statement about copyright along the lines of these amendments. We absolutely should not expect artists to have to opt out. AI developers must be transparent about the identity and purposes of their crawlers and have separate crawlers for distinct purposes. Unless news publishers and the broader creative industries can retain control over their data, making UK copyright law enforceable, AI firms will be free to scrape the web without remunerating creators. This will not only reduce investment in trusted journalism but ultimately harm innovation in the AI sector.

Lord Faulks (Non-Afl)

- Hansard - -

The noble Lord has enormous experience in these areas and will be particularly aware of the legal difficulties in enforcing rights. Given what he said, with which I entirely agree—indeed, I agree with all the speakers in supporting these amendments—and given the extraordinary expense of litigating to enforce rights, how does he envisage there being an adequate system to allow those who have had their data scraped in the way that he describes to obtain redress or, rather, suitable remedies?

Lord Clement-Jones (LD)

- Hansard - - - Excerpts

I thank the noble Lord for that. He is anticipating a paragraph in my notes, which says that, although it is not set out in the amendments, robust enforcement of these provisions will be critical to their success. This includes oversight from an expert regulator that is empowered to issue significant penalties, including fines for non-compliance. There is a little extra work to do there, and I would very much like to see the Intellectual Property Office gain some teeth.

I am going to close. We are nearly at the witching hour, but it is clear that AI developers are seeking to use their lobbying clout—the noble Baroness, Lady Kidron, mentioned the Kool-Aid—to persuade the Government that new copyright law is required. Instead, this amendment would clarify that UK copyright law applies to gen AI developers. The creative industries, and noble Lords from across the House as their supporters, will rally around these amendments and vigorously oppose government plans for a new text and data- mining exception.

--- Later in debate ---

Lord Faulks (Non-Afl)

- Hansard - -

My Lords, I have very little to add because I entirely support all these amendments. I am always concerned when I see the words “lack of clarity” in a context like this. The basic principle of copyright law, whereby one provides a licence and is paid for that licence by agreement, has been well established. There is no need for any further clarity in this context, as in earlier contexts of copyright law.

I should declare an interest as the chairman of IPSO, the regulator of 95% of the printed news media and its online versions. I have been impressed by the News Media Association’s briefings. It has identified important issues. I am extremely concerned about what appears to have been a considerable amount of lobbying by big tech in this area. It reminds me of what took place when your Lordships’ House considered the Digital Markets, Competition and Consumers Bill. A low point for me was when we were told that it would be very difficult to establish a proper system otherwise Google’s human rights would be somehow infringed. It is extremely important that this so-called balance does not mean that those who create original material protected by the copyright Acts have their rights violated in order to satisfy the interests of big tech.

The Earl of Effingham (Con)

- Hansard - - - Excerpts

My Lords, my noble friend Lord Camrose apologises to the Committee but he has had to leave early for unavoidable family reasons. Needless to say, he will read Hansard carefully.

It is our belief that a society that fails to value products of the mind will never be an innovative society. We are fortunate to live in that innovative society now and we must fight to ensure it remains one. Data scraping and AI crawlers pose both novel and substantial challenges to copyright protection laws and mechanisms. His Majesty’s Official Opposition are pleased that these amendments have been brought forward to address those challenges, which differ from those posed by traditional search engine crawlers.

Generally speaking, in creating laws about data we have been able to follow a north star of replicating online the values and behaviours we take for granted offline. This was of real service to us in the Online Safety Act, for example. In many ways, however, that breaks down when we come to AI and copyright. Offline, we are happy to accept that an artist, author, musician or inventor has been influenced by existing works in their field. Indeed, we sometimes celebrate that fact, and we have a strong intuitive sense of when influence has crossed the line into copying. This means that we can form an intuitive assessment of whether a copyright has been breached offline based on what creators produce, not what content they have consumed, which we expect to be extensive. With an AI crawler, that intuition and model break down. There are simply too many variables and too much information. We have no choice but to go after the inputs.

With that in mind, it would be helpful to set out the differences between traditional search engine crawlers and AI crawlers. Indexing crawlers used by the search engines we are all familiar with store information in their indexes. This then determines the results of the search. However, AI crawlers generally fall into two categories. The training crawlers scrape the web, collecting data used to train large language models. Live retrieval crawlers pull in live data from the web and incorporate it into chatbot responses.

Historically, the robots exclusion protocol—the plain text file identified as robots.txt—has been embedded into website domains, specifying to crawlers what data they can and cannot access in part or all of the domain. This has been used for the past 30 years to protect information or IP from indexing crawlers. Although the robots exclusion protocol has worked relatively well for many years, in some ways it is not fit for the web as it exists today—especially when dealing with AI crawlers.

To exclude crawlers from websites, we must be able to identify them. This was, for the most part, workable in the early days of the internet when there were relatively few search engines and, correspondingly, few indexing crawlers. However, given the rapidly increasing number of AI services, with their corresponding crawlers trawling the web, it becomes impossible to exclude them all. To make matters worse, some AI crawlers operate in relative secrecy. Their names, which can be viewed through domain holder access logs, reveal little of their purpose.

Furthermore, the robots exclusion protocol is not an enforceable agreement; it is more like a polite request. Based on that, a crawler can simply ignore a robots.txt file and scrape the data anyway. It is also worth noting that even if a crawler acknowledges and obeys a robots.txt file, the data may be inadvertently scraped from a third-party source who has lifted the data of intellectual property either manually or using a crawler that does not obey the robots.txt files. That can then be made available without the protection of the robots exclusion protocol. This raises an unsettling question: how do we protect intellectual property and data more generally from these AI crawlers, whose developers decline the voluntary limitations placed on them?

At this point, I turn to the amendments. Amendment 204 is a great initial step toward requiring crawler operators to respect UK copyright law. However, this provision would apply only to products and services of such operators that are marketed in the United Kingdom. What about those from outside the UK? Indeed, as my noble friend Lord Camrose has often argued, any AI lab that does not want to follow our laws can infringe the same copyright with impunity in another jurisdiction. Unless and until we address the offshoring problem, we continue to have real concerns as to the enforceability of any regulations we implement here.

I will address the individual subsections in Amendment 205. Proposed new subsection (1) would require crawlers to reveal their identity, including their name, who is responsible for them, their purpose, who receives their scraped data, and a point of contact. This is an excellent idea, although we are again concerned about enforceability due to offshoring. Proposed new subsection (2) requires this information to be easily accessible. We are sure this would be beneficial, but our concerns remain about infringements in other jurisdictions.

Requiring the deployment of crawlers with distinct purposes in proposed new subsection (3) is an excellent idea as it would allow data controllers to choose what data can be trawled and for what purpose, to the extent possible using the robots exclusion protocol. We do, however, have concerns about proposed new subsection (4). We are not sure how it would be possible for the exclusion of an AI crawler not to impact the findability of content. We assume this could be achieved only if we mandated the continued use of indexing crawlers.

As for Amendment 206, requiring crawler operators to regularly disclose the information scraped from copyrighted sources and make it accessible to copyright holders on their request is an interesting suggestion. We would be curious to hear how this would work in practice, particularly given the vast scale—some of those models crawl billions of documents, generating trillions of tokens. Where would that data be published? Given the scale of data-scraping, how would copyright holders know where to look for this information? If the operator was based outside the UK, how would disclosure be enforced? Our view is that watermarking technology can come to the rescue, dependent of course on an internationally accepted technical standard for machine-readable watermarks that contain licensing information.