Data (Use and Access) Bill [HL] Debate
Full Debate: Read Full DebateLord Tarassenko
Main Page: Lord Tarassenko (Crossbench - Life peer)Department Debates - View all Lord Tarassenko's debates with the Department for Business and Trade
(5 months, 1 week ago)
Grand CommitteeMy Lords, I will speak briefly in support of this amendment. Anyone who has written computer code, and I plead guilty, knows that large software systems are never bug-free. These bugs can arise because of software design errors, human errors in coding or unexpected software interactions for some input data. Every computer scientist or software engineer will readily acknowledge that computer systems have a latent propensity to function incorrectly.
As the noble Baroness, Lady Kidron, has already said, we all regularly experience the phenomenon of bug fixing when we download updates to software products in everyday use—for example, Office 365. These updates include not only new features but patches to fix bugs which have become apparent only in the current version of the software. The legal presumption of the proper functioning of “mechanical instruments” that courts in England and Wales have been applying to computers since 1999 has been shown by the Post Office Horizon IT inquiry to be deeply flawed. The more complex the program, the more likely the occurrences of incorrect functioning, even with modular design. The program at the heart of Fujitsu’s Horizon IT system had tens of millions of lines of code.
The unwillingness of the courts to accept that the Horizon IT system developed for the Post Office was unreliable and lacking in robustness—until the key judgment, which has already been mentioned, by Mr Justice Fraser in 2019—is one of the main reasons why more than 900 sub-postmasters were wrongly prosecuted. The error logs of any computer system make it possible to identify unexpected states in the computer software and hence erroneous system behaviour. Error logs for the Horizon IT system were disclosed only in response to a direction from the court in early 2019. At that point, the records from Fujitsu’s browser-based incident management system revealed 218,000 different error records for the Horizon system.
For 18 years prior to 2019, the Post Office did not disclose any error log data, documents which are routinely maintained and kept for any computer system of any size and complexity. Existing disclosure arrangements in legal proceedings do not work effectively for computer software, and this amendment concerning the electronic evidence produced by or derived from a computer system seeks to address this issue. The Post Office Horizon IT inquiry finished hearing evidence yesterday, having catalogued a human tragedy of unparalleled scale, one of the most widespread miscarriages of justice in the UK. Whether it is by means of this amendment or otherwise, wrongful prosecutions on the basis that computers always operate properly cannot continue any longer.
My Lords, if I may just interject, I have seen this happen not just in the Horizon scandal. Several years ago, the banks were saying that you could not possibly find out someone’s PIN and were therefore refusing to refund people who had had stuff stolen from them. It was not until the late Professor Ross Anderson, of the computer science department at Cambridge University, proved that they had been deliberately misidentifying to the courts which counter they should have been looking at, as to what was being read, and explained exactly how you could get the thing to default back to a different set of counters, that the banks eventually had to give way. But they went on lying to the courts for a long time. I am afraid that this is something that keeps happening again and again, and an amendment like this is essential for future justice for innocent people.
My Lords, before we proceed, I draw to the attention of the Committee that we have a hard stop at 8.45 pm and we have committed to try to finish the Bill this evening. Could noble Lords please speak quickly and, if possible, concisely?
My Lords, I support my noble friend Lady Kidron’s Amendment 211, to which I have put my name. I speak not as a technophobe but as a card-carrying technophile. I declare an interest as, for the past 15 years, I have been involved in the development of algorithms to analyse NHS data, mostly from acute NHS trusts. This is possible under current regulations, because all the research projects have received medical research ethics approval, and I hold an honorary contract with the local NHS trust.
This amendment is, in effect, designed to scale up existing provisions and make sure that they are applied to public sector data sources such as NHS data. By classifying such data as sovereign data assets, it would be possible to make it available not only to individual researchers but to industry—UK-based SMEs and pharmaceutical and big tech companies—under controlled conditions. One of these conditions, as indicated by proposed new subsection (6), is to require a business model where income is generated for the relevant UK government department from access fees paid by authorised licence holders. Each government department should ensure that the public sector data it transfers to the national data library is classified as a sovereign data asset, which can then be accessed securely through APIs acting
“as bridges between each sovereign data asset and the client software of the authorized licence holders”.
In the time available, I will consider the Department of Health and Social Care. The report of the Sudlow review, Uniting the UK’s Health Data: A Huge Opportunity for Society, published last month, sets out what could be achieved though linking multiple NHS data sources. The Academy of Medical Sciences has fully endorsed the report:
“The Sudlow recommendations can make the UK’s health data a truly national asset, improving both patient care and driving economic development”.
There is little difference, if any, between health data being “a truly national asset” and “a sovereign asset”.
Generative AI has the potential to extract clinical value from linked datasets in the various secure data environments within the NHS and to deliver a step change in patient care. It also has the potential to deliver economic value, as the application of AI models to these rich, multimodal datasets will lead to innovative software products being developed for early diagnosis and personalised treatment.
However, it seems that the rush to generate economic value is preceding the establishment of a transparent licensing system, as in proposed new subsection (3), and the setting up of a coherent business model, as in proposed new subsection (6). As my noble friend Lady Kidron pointed out, the provisions in this amendment are urgently needed, especially as the chief data and analytics officer at NHS England is reported as having said, at a recent event organised by the Health Service Journal and IBM, that the national federated data platform will soon be used to train different types of AI model. The two models mentioned in the speech were OpenAI’s proprietary ChatGPT model and Google’s medical AI, which is based on its proprietary large language model, Gemini. So, the patient data in the national federated data platform being built by Palantir, which is a US company, is, in effect, being made available to fine-tune large language models pretrained by OpenAI and Google—two big US tech companies.
As a recent editorial in the British Medical Journal argued:
“This risks leaving the NHS vulnerable to exploitation by private technology companies whose offers to ‘assist’ with infrastructure development could result in loss of control over valuable public assets”.
It is vital for the health of the UK public sector that there is no loss of control resulting from premature agreements with big tech companies. These US companies seek privileged access to highly valuable assets which consist of personal data collected from UK citizens. The Government must, as a high priority, determine the rules for access to these sovereign data assets along the lines outlined in this amendment. I urge the Minister to take on board both the aims and the practicalities of this amendment before any damaging loss of control.
My Lords, I support Amendment 211 moved by my noble friend Lady Kidron, which builds on earlier contributions in this place made by the noble Lords, Lord Mitchell, Lord Stevenson, Lord Clement-Jones, and myself, as long ago as 2018, about the need to maximise the social, economic and environmental value that may be derived from personal data of national significance and, in particular, data controlled by our NHS.
The proposed definition of “sovereign data assets” is, in some sense, broad. However, the intent to recognise, protect and maximise their value in the public interest is readily inferred. The call for a transparent licensing regime to provide access to such assets and the mention of preferential access for individuals and organisations headquartered in the UK also make good sense, as the overarching aim is to build and maintain public trust in third-party data usage.
Crucially, I fully support provisions that would require the Secretary of State to report on the value and anticipated financial return from sovereign data assets. Identifying a public body that considered itself able or willing to guarantee value for money proved challenging when this topic was last explored. For too long, past Governments have dithered and delayed over the introduction of provisions that explicitly recognise the need to account for and safeguard the investment made by taxpayers in data held by public and arm’s-length institutions and associated data infrastructure—something that we do as a matter of course where the tangible assets that the National Audit Office monitors and reports on are concerned.
In recent weeks, the Chancellor of the Exchequer has emphasised the importance of recovering public funds “lost” during the Covid-19 pandemic. Yet this focus raises important questions about other potential revenue streams that were overlooked, particularly regarding NHS data assets. In 2019, Ernst & Young estimated that a curated NHS dataset could generate up to £5 billion annually for the UK while also delivering £4.6 billion in yearly patient benefits through improved data infrastructure. This begs the question: who is tracking whether these substantial economic and healthcare opportunities are being realised? Who is ensuring that these projected benefits—both financial and clinical—are actually flowing back into our healthcare system?
As we enter the age of AI, public discourse often fixates on potential risks while overlooking a crucial opportunity—namely, the rapidly increasing value of publicly controlled data and its potential to drive innovation and insights. This raises two crucial questions. First, how might we capitalise on the upside of this technological revolution to maximise the benefits on behalf of the public? Secondly, and more specifically, how will Parliament effectively scrutinise any eventual trade deal entered into with, for example, the United States of America, which might focus on a more limited digital chapter, in the absence of either an accepted valuation methodology or a transparent licensing system for use in providing access to valuable UK data assets?
Will the public, faced with a significant tax burden to improve public services and repeated reminders of the potential for data and technology to transform our NHS, trust the Government if they enable valuable digital assets to be stripped today only to be turned tomorrow into cutting-edge treatments that we can ill afford to purchase and that benefit companies paying taxes overseas? To my mind, there remains a very real risk that the UK, as my noble friend Lady Kidron, rightly stated, will inadvertently give away potentially valuable digital assets without there being appropriate safeguards in place. I therefore welcome the intent of Amendment 211 to put that right in the public interest.
Data (Use and Access) Bill [HL] Debate
Full Debate: Read Full DebateLord Tarassenko
Main Page: Lord Tarassenko (Crossbench - Life peer)Department Debates - View all Lord Tarassenko's debates with the Department for Science, Innovation & Technology
(4 months ago)
Lords ChamberMy Lords, I speak in support of the noble Baroness, Lady Kidron, on Amendment 58, to which I have also put my name. Given the time, I will speak only about NHS datasets.
There have been three important developments since the Committee stage of this Bill in mid-December: the 43rd annual J P Morgan healthcare conference in San Francisco in mid-January, the launch of the AI Opportunities Action Plan by the Prime Minister on Monday 13 January and the announcement of the Stargate project in the White House the day after President Trump’s inauguration.
Taking these in reverse chronological order, it is not clear exactly how the Stargate project will be funded, but several US big tech companies and SoftBank has pledged tens of billions of dollars. At least $100 billion will be available to build the infrastructure for next-generation AI, and it may even rise to $500 billion in the next four years.
The UK cannot match these sums. The AI Opportunities Action Plan instead lays out how the UK can compete by using its own advantages: a long track record of world-leading AI research in our universities and some unique, hugely valuable datasets.
At the JP Morgan conference in San Francisco, senior NHS management had more than 40 meetings with AI companies. These companies all wanted to know one thing: how and when they could access NHS datasets.
It is not surprising, therefore, that it was reported in November that the national federated data platform would soon be used to train different types of AI models. The two models mentioned were Open AI’s proprietary ChatGPT and Google’s medical AI, Med-Gemini, based on Google’s proprietary large language model, Gemini. Presumably, these models will be fine-tuned using the data stored in the federated data platform.
Amendment 58 is not about restricting access to UK datasets by Open AI, Google or any other US big tech company. Instead, it seeks to maximise their long- term value, driven by strategic goals rather than short-term, opportunistic gains. By classifying valuable public sector datasets as sovereign data assets, we can ensure that the data is made available under controlled conditions, not only to public sector employees and researchers but to industry, including US big tech companies.
We should expect a financial return when industry is given access to a sovereign dataset. A first condition is a business model such that income is generated for the relevant public body, in this case the NHS, from the access fees paid by the companies that will be the authorised licence holders.
A second condition is signposted in the AI Opportunities Action Plan, whose recommendations have all been accepted by the Government. In the third section of the action plan, “Secure our future with homegrown AI”, Matt Clifford, the author of the plan, writes that
“we must be an AI maker, not just an AI taker: we need companies … that will be our UK national champions … Generating national champions will require a more activist approach”.
Part of this activist approach should be to give companies and organisations headquartered in the UK preferential terms of access to our sovereign data assets.
These datasets already exist in the NHS as minimum viable products, so we cannot afford to delay. AI companies are keen to access data in the federated data platform, which is NHS England’s responsibility, or in the secure data environments set up by the National Institute for Health and Care Research, NIHR.
I urge the Government to accept the principles of this amendment as they will provide the framework needed now to support NHS England and NIHR in their negotiations with AI companies.
I have signed Amendment 58. I also support the other amendment spoken to by the noble Baroness, although I did not get around to signing it. They both speak to the same questions, some of which have been touched on by both previous speakers.
My route into this was perhaps a little less analytic. I used to worry about the comment lots of people used to make, wittily, that data was the new oil, without really thinking about what that meant or what it could mean. It began to settle in my mind that, if indeed data is an asset, why is it not carried on people’s balance sheets? Why does data held by companies or even the Government not feature in some sort of valuation? Just like oil held in a company or privately, it will eventually be used in some way. That releases revenue that would otherwise have to be accounted for and there will be an accounting treatment. But as an accountant I have never seen any company’s assets that ever put a value on data. That is where I came from.
A sovereign data approach, which labels assets of value to the economy held by the country rather than a company, seems to be a way of trying to get into language what is more of an accounting approach than perhaps we need to spend time on in this debate. The noble Baroness, Lady Kidron, has gone through the amendment in a way that explains the process, the protection and the idea that it should be valued regularly and able to account for any returns it makes. We have also heard about the way it features in other publications.
I want to take a slightly different part of the AI Opportunities Action Plan, which talks about data and states:
“We should seek to responsibly unlock both public and private data sets to enable innovation by UK startups and researchers and to attract international talent and capital. As part of this, government needs to develop a more sophisticated understanding of the value of the data it holds, how this value can be responsibly realised, and how to ensure the preservation of public trust across all its work to unlock its data assets”.
These are very wise words.
I end by saying that I was very struck by the figures released recently about the number of people who opted out of the NHS’s data collection. I think there are Members present who may well be guilty of such a process. I of course am happy to have my data used in a way that will provide benefit, but I do recognise the risks if it is not properly documented and if people are not aware of what they are giving up or offering in return for the value that will be extracted from it.
I am sure we all want more research and better research. We want research that will yield results. We also want value and to be sure that the data we have given up, which is held on our behalf by various agencies, is properly managed. These amendments seem to provide a way forward and I recommend them.
Data (Use and Access) Bill [HL] Debate
Full Debate: Read Full DebateLord Tarassenko
Main Page: Lord Tarassenko (Crossbench - Life peer)Department Debates - View all Lord Tarassenko's debates with the Department for Business and Trade
(2 weeks, 4 days ago)
Lords ChamberMy Lords, I sat through the Committee stage and did not speak, because I was so clear that the amendment tabled by the noble Viscount, Lord Colville, was really inappropriate. Indeed, it was the speech by my noble friend Lord Vallance that made me feel that speaking was unnecessary. I regret that, because I should have pointed out something very important.
First, to come back to what the noble Baroness has just said, the definitions in the Bill are flawed to start with. Devising new applications of available knowledge is not the work of scientists; it is essentially the work of technologists—and there is a big difference between technology and science. Technology has all sorts of downsides that we do not expect. Science is simply knowledge and, as such, it does not have an ethical dimension. That has been stated by many distinguished philosophers in the past.
I suggest to noble Lords that every invention and innovation that we make has a downside that we do not expect. I would challenge the noble Viscount, Lord Colville. I have sat on a scientific committee with him and we have enjoyed each other’s company, but I would argue that there is not a single case where he could argue that there has not been an unexpected downside—as there is with every scientific invention. That, essentially, is why this is a flawed problem.
There are three problems facing humanity and the world at the moment. One, clearly, is nuclear war; one, clearly, is climate change; and perhaps the most important one is the risk of antibiotic resistance. It turns out that all these risks were completely unseen when those technologies were started. Marie Curie, working on pitchblende and radium, had no idea that the X-rays she was producing would be dangerous to her and would, unfortunately, result in her death as a result of cancer. She certainly did not recognise that there might be such a thing as nuclear explosions.
When it comes to, for example, climate change, it is obvious that we would not want to avoid having had the Industrial Revolution and all the things we got from it, just as we would not want to avoid having X-rays. But we must understand that the control has to come later.
Lastly, the most important thing for me is the possibility that infection is one of the greatest problems that humanity faces on earth at the moment. The invention of antibiotics has actually made that worse, because we now have a rising number of infections that are due to resistance. Therefore, I beg the noble Viscount not to press his amendment. I believe that it is absolutely well-meaning, and I understand what he is trying to say, but I hope that the Opposition, when they listen to this, will consider at least abstaining from voting on it, because the risk is that it might bring the House of Lords into disrepute as it stands.
My Lords, I am authorised to speak on Motion 43A, as someone with regular day-to-day experience of scientific research. Since I started my PhD in 1981, I have had the privilege of spending more than half my working life doing scientific research in the UK—the last 20 years working with very sensitive patient data. Most of that research has been carried out in an academic setting, but some of it has been in collaboration with medtech, AI and pharmaceutical companies.
This research has required me to become familiar with many three-letter and four-letter acronyms. Noble Lords will know about DBS, but they might not know about RSO, TRO, HRA, LREC, MREC, CAG, and IRAS, to name just a few. I have spent hundreds of hours working with clinical colleagues to fill in integrated research application system—IRAS—forms. IRAS is used to apply for Health Research Authority—HRA—approval for research projects involving the NHS, social care or the criminal justice system. I have appeared before not only medical research ethics committees, or MRECs, which test whether a research protocol is scientifically valid and ethical, but local research ethics committees, or LRECs, which consider the suitability of individual researchers and local issues.
I was involved in a research project which reused data acquired from patients on a Covid isolation ward during the first two waves of the pandemic. That research project sought to understand how nurses interpreted continuous data from the clinical-grade wearables we used to monitor these high-risk patients during Covid. It took our research team more than 18 months to obtain the relevant permissions to reuse the data for our proposed analysis. Our application was reviewed by the Confidentiality Advisory Group—CAG—which provides independent expert advice on the use of confidential patient information without consent for research and non-research purposes. CAG already considers whether accessing the confidential data is justified by the public interest. Its advice is then used by the HRA and the Secretary of State for Health and Social Care to decide whether to grant access to the confidential data.
The existing provisions in this country to allow access to data for research purposes are stringent, and it is entirely right that they should be. The UK is respected the world over for the checks and balances of its research governance. The relevant safeguards already exist in the current legislation. Adding a further public interest test will only increase the amount of bureaucracy that will inevitably be introduced by the research services offices, or RSOs, and the translational research offices, or TROs, of our universities, which are very good at doing this.
The extra burden will fall on the researchers themselves, and some researchers may decide to concentrate their available time and energy elsewhere. This amendment, I am afraid, will have the unintended consequence of having a negative impact on research in this country, so I cannot support it.
My Lords, an onlooker might be forgiven for not perceiving a common theme in this group of amendments, but I thank the Minister for his introduction and the noble Viscounts for introducing their amendments so clearly.
I acknowledge that Motion 32A and Amendments 32B and 32C and Motion 52A and Amendments 52B and 52C from the noble Viscount, Lord Camrose, are considerably less prescriptive than the Spencer amendment in the House of Commons to introduce new Clause 21, which seemed to require public authorities to comb through every record to rectify data, went significantly further than the findings of the Supreme Court judgment, and potentially failed to account for the privacy afforded to GRC holders under the Gender Recognition Act. However, the Liberal Democrats will abstain from votes on the noble Viscount’s amendments for several key reasons.
Our primary reason is the need to allow time for the EHRC’s guidance to be finalised. I thought the Minister made his case there. The EHRC is currently updating its code of practice, as we have heard, to reflect the implications of the Supreme Court judgment on the meaning of sex in the Equality Act, with the aim of providing it to the Government by the end of June. This guidance, as I understand it, is intended specifically to support service providers, public bodies and others in understanding their duties under the Equality Act and putting them into practice in the light of the judgment. The EHRC is undertaking a public consultation to understand how the practical implications can best be reflected. These amendments, in our view, are an attempt to jump the gun on, second-guess or at the least pre-empt the EHRC’s code of practice.
On these Benches, we believe that any necessary changes or clarifications regarding data standards should be informed by the official guidance and implemented consistently in a coherent and workable manner. We should allow time for the EHRC’s guidance to be finalised, ensuring that any necessary changes or clarifications regarding data standards are informed by its advice and implemented consistently across public authorities in a coherent and workable manner. We have concerns about workability and clarity. Although the amendments proposed by the noble Viscount, Lord Camrose, are less prescriptive than previous similar proposals in the Commons tabled by Dr Spencer, we have concerns about their practical implementation. Questions arise about how public authorities would reliably ascertain biological sex if someone has a gender recognition certificate and has updated their birth certificate. I have long supported same-sex wards in the NHS, but I do not believe that these amendments are helpful in pursuing clarity following the Supreme Court judgment. We heard what the Minister had to say about passports.
I welcome the clarity provided by the Supreme Court judgment, but there are clearly implications, both practical and legal, to be worked out, such as those mentioned by the noble Viscount, Lord Hailsham. I thought he put his finger on many of those issues. I trust that the EHRC will deliver the right result. I agree that data needs to be accurate, and I welcome the Sullivan report, as did my noble friend. In summary, we will be abstaining. We believe that the EHRC process needs to conclude and provide comprehensive guidance, while also reflecting concerns about the workability and appropriateness of specific legislative interventions on data standards at this time.
I move on to Amendment 43B, tabled by the noble Viscount, Lord Colville. This amendment may not reinstate the precise wording
“conducted in the public interest”
that we previously inserted in this House, but it would introduce safeguards that seek to address the same fundamental concerns articulated during our debate on Report. It does two important things.
First, it provides a definition of “scientific research”, clarifying it as
“creative and systematic work undertaken in order to increase the stock of knowledge”.
This directly addresses the concerns raised on Report that the line between product development and scientific research is often blurred, with developers sometimes positing efforts to increase model capabilities or study risks as scientific research. Having a clear definition helps to distinguish genuine research from purely commercial activity cloaked as such.
Secondly, and critically, Amendment 43B would require:
“To meet the reasonableness test”
already present in the Bill,
“the activity being described as scientific research must be conducted according to appropriate ethical, legal and professional frameworks, obligations and standards”.
This requirement seeks to embed within the reasonableness test the principles that underpinned our arguments for the public interest requirement on Report and is the same as the amendment put forward by the chair of the Science, Innovation and Technology Select Committee, Chi Onwurah MP, which ties the definition to the definition in the OECD’s Frascati Manual: Guidelines for Collecting and Reporting Data on Research and Experimental Development:
“creative and systematic work undertaken in order to increase the stock of knowledge—including knowledge of humankind, culture and society—and to devise new applications of available knowledge”.
The Frascati framework is used worldwide by Governments, universities and research institutions to report R&D statistics, inform science policy and underpin R&D tax credit regimes, and it serves as a common language and reference point for international comparisons and policy decisions related to scientific research and innovation. These frameworks, obligations and standards are important because they serve the very purposes we previously identified for the public interest test: ensuring societal benefit, building public trust, preventing misuse for commercial ends, addressing harmful applications, and alignment with standards.
Amendment 43B in the name of the noble Viscount, Lord Colville, is a thoughtful and necessary counter-proposal. It is Parliament’s opportunity to insist that the principles of public benefit, trust and responsible conduct, rooted in established frameworks, must remain central to the definition of scientific research that benefits from data re-use exceptions.
I heard what the noble Lord, Lord Winston, had to say in his very powerful speech, but I cannot see how the amendment from the noble Viscount, Lord Colville, cuts across all the things that he wants to see in the outcomes of research.
My Lords, I rise to speak as the founder of two AI spin-outs, and I draw the House’s attention to my registered interests as the founder-director of Oxehealth, a University of Oxford spin-out that uses AI for healthcare applications. I am also the author of three copyrighted books.
Since these amendments were last debated in the House of Lords, there has been a lot of high-profile comment but very few attempts, if any, to bring AI developers and creators together in the same room. During the same period, however, more businesses from the creative industries and the publishing sector have agreed content-licensing deals. That is because access to curated, high-quality content to fine-tune large language models—the step after pre-training which provides high-accuracy responses—is increasingly being monetised.
Even the Guardian Media Group, a strong supporter of the creative industries, announced in February a strategic partnership with Open AI to ensure compensation for the use of its high-quality journalism. This shows that it is possible, without any change in the law, for the creative industries and the big tech companies to come to licensing agreements.
The main technological development since our last debate has been the demonstration that training LLMs no longer requires the massive computer facilities and huge data centres of the big tech companies in the US. Since the beginning of the year, the Chinese company DeepSeek has released open-source LLMs hundreds of times smaller than hyperscale models such as GPT-4, Gemini or Claude Sonnet. These models, typically with, say, 10 billion weights, have been developed through the process of distillation, and they achieve almost the same level of performance as the hyperscale models with 1 trillion weights.
Why is that important? It means that users of LLMs no longer have to send queries to those hyperscale models which are then processed by OpenAI, Google or Anthropic using their huge compute facilities with thousands of GPUs in their data centres. Instead, any AI developer can now train and run distilled versions of those LLMs locally on their laptops.
DeepSeek was the first AI company to show how powerful the process of distillation is in the context of LLMs. Other big tech companies are now jumping on the bandwagon. In early March, Google released a brand-new LLM called Gemma 3, a lightweight, state-of-the-art open-source model that can be run anywhere from smartphones to laptops, and has the ability to handle text, images, and short videos.
These open-source distilled LLMs are now being used by thousands of AI developers, in the UK and elsewhere, who are training and fine-tuning them using content, some of which may be copyrighted, publicly available on the web. Training an LLM on a laptop using data from the open web will become as commonplace as searching the web. This is already happening both within computer science departments in UK universities and in the rich ecosystem of AI start-ups and university spin-outs in the UK.
A survey of 500 developers and investors in the UK AI ecosystem, carried out by JL Partners last month, had 94% of them reporting that their work relied on AI models built using publicly available data from the web, and 66% reported that if the data laws in the UK were more restrictive than elsewhere, projects would move to other countries. We need to consider the impact on the UK’s AI industry of these transparency provisions, and of the requirement to provide copyright owners with information regarding the text and data used in the pre-training, training and fine-tuning of general-purpose AI.
The use of content from behind paywalls or from pirated databases such as Books3 or LibGen, which is known to have been done by Meta to train its LLM, is clearly illegal. However, for data publicly available on the open web, I would like to do a simple thought experiment to show that the transparency requirements in Motion 49A are at present unworkable. In the UK, unlike in the US, there is no copyright database. Usually, the copyright rests with the author of the work, but there are exceptions, such as when a work is created by an employee in the course of their job, and copyright may also be assigned or transferred to a third party. If we assume, generously, that it might take just one second, on average, to ascertain the copyright status of an article, book, image, or audio or video recording, on the web, it would require 31 years and eight months to check the copyright status of the 1 billion data points in a typical LLM training set—never mind thinking about setting up licensing deals with the millions of rights holders. For the distilled models that are now, as I explained, being trained or fine-tuned by UK developers, which are 100 times smaller, the copyright status check would still require one-third of a year—still an entirely unworkable proposition.