Whether you are developing, deploying, or purchasing AI technologies, this checklist is designed to help you navigate critical areas, including governance, risk mitigation, data privacy, and compliance.
Diving into Data: Managing the Legal Risks of a Data-Driven Economy
“Data is the new oil,” a phrase coined by Clive Humby in 2006, captures the immense value data holds in today’s economy. As evidenced by data brokers and big data analytics, data has created its own market – and it’s thriving, while also driving other markets. With the ability to license vast amounts of data and apply it through AI, it’s increasingly central to innovation and decision-making. Businesses must navigate this accelerating and transformative time in a complex and evolving legal landscape. To do so, there are key considerations when Diving into Data.
A. Key Considerations When Creating Your Own Dataset
As development of machine learning systems grows, so does the demand for quality datasets to train and evaluate those systems. A pivotal question for the dataset creator becomes how to responsibly curate the data. While much of the attention is too often focused on the model and algorithmic performance, if there are problems with the foundational training data, there will be problems with the resulting model’s performance … garbage in, garbage out.
In curating a quality dataset, creators need to be mindful of several factors, including privacy and ethical considerations.
Privacy Considerations: Any information collected from or about people deserves the highest level of scrutiny. Personal data includes a broad category of information, more than just an individual’s name or social security number. It can include indirect identifiers that can reveal the identity of an individual ranging from hair color to job title to location data. If collecting data that will include personal data, the data collector must secure that individual’s informed consent prior to using the data. This requires the data collector to be transparent about how the data will be collected, used and stored. Similarly, the data collector needs to be aware that certain categories of data (e.g., health, biometric, financial) are subject to additional privacy regulations under General Data Protection Regulation (GDPR), HIPAA and different state laws. One way to protect a data subject’s privacy is to de-identify the dataset.
Ethical Considerations: Data ownership arguably is the foremost principle of data ethics. The data collector needs to respect copyrights and other intellectual property rights before using data. Apart from data ownership, there are several other ethical considerations when assembling a dataset. Even with the best intentions, the manner in which data is collected can lead to a disparate impact. The data collector should strive to collect diverse data that represents various cohorts and conditions to avoid bias in AI systems. This may include implementing an ethics review/approval process to identify and mitigate biases in the data, to ensure that the dataset does not perpetuate societal biases. While these measures may require additional time and resources, the upfront investment can potentially avoid the more costly task of having to dismantle your model or algorithm and start over.
B. Key Considerations When Purchasing/Licensing Proprietary Data
Use of datasets from a third party requires a license. Datasets may be protected by IP or privacy law. While there may be similarities to other IP licenses and overlap between datasets, each has distinct features that warrant close review. Below are key considerations in agreements involving data generally:
Ownership: Clarify who owns the derived data created from the licensed data (AI output) to meet business needs and reduce the risk of disputes.
Scope: A licensee should have a clear understanding of the business needs driving the data license. Because licenses include restrictions – such as sublicensing, creation of derivative works, and geographical – it is critical that the scope of the license aligns with the business needs, such as customer profiling, campaign optimization, or product development. Be mindful of any flow down requirements.
Warranties: Not all data is created equal – or legally acquired. Require the licensor to warrant its legal right to license the data and that such rights will not be revoked. This is especially important for privacy, as some personal information, like sensitive personal information, requires consent.
Liability Allocation: Cap licensee liabilities and include indemnification for IP and privacy violations.
Survival or Effects of Termination: Ensure the right to continue using data already received or derived (and any models or products updated or modified by said data) after termination.
Cybersecurity: Maintain appropriate physical, administrative, and technical safeguards for the type of data being used, including a data breach response plan. Consider purchasing cybersecurity insurance for coverage in the event of a breach involving the licensed data.
Privacy Law Compliance: Ensure compliance with applicable privacy laws, such as the California Consumer Privacy Act and GDPR.
Outsourcing datasets can transform your business – but be sure you are aware of the risks and have proper guardrails in place.
C. Key Considerations When Using Public Data
Responsible use of public datasets can help accelerate innovation. Organizations should treat public data with the same diligence as proprietary data, ensuring legal and privacy compliance, and ethical integrity. A proactive approach to governance and risk management is essential to unlocking the full value of public data while protecting your organization. When using public datasets for model training, there are several key considerations:
Privacy Issues: Evaluate whether the datasets include any personal information, as this could implicate privacy laws and require specific consents or licenses. Moreover, under the GDPR, the use of public datasets containing personal data may require the completion of a Data Processing Impact Assessment.
Source and License Terms of Data: The source of the data is also important, as publicly available web data may be subject to licenses and website terms of use, and using such data for training could lead to breach of contract and/or infringement claims and impact an organization’s rights to AI outputs. Review the dataset card, if available. Defenses such as fair use may be available in certain jurisdictions.
Data Quality and Ethical Issues: Implement or review processes for ensuring the quality and integrity of the training data to mitigate risks associated with biased, incomplete, or inaccurate data.
Governance, Documentation and Other Risk Mitigation Strategies: Establish clear internal processes for vetting public datasets before use (e.g., implement a review process for all external datasets that includes a review of licensing, privacy and ethics). Obtain warranties, covenants or other protections concerning the developer’s or vendor’s right to obtain and use training data for AI development. Negotiate indemnification coverage for potential privacy or infringement claims arising from use of AI models or AI outputs.
Related Professionals
- Of Counsel
- Associate
- Of Counsel