ChatGPT and Data Leakage: Everything Old is New Again

By: Sasha Gohman,

Last Updated: July 1, 2024

Data leakage (a form of data exfiltration) is not a new topic in the cybersecurity world. For as long as there have been humans, there’s been the risk of sensitive information accidentally (or purposely) falling into the wrong hands.

From company secrets to Personally Identifiable Information (PII), some information is restricted by company policies and other info is restricted by regulatory concerns like HIPAA and GDPR.

Because of the nature of data leakage, and the potential repercussions of having that data move outside of the organization’s control, attempts to limit and/or block data leakage have been an ongoing concern and an ongoing problem for industries in all verticals and of all sizes. Cymulate’s own Annual Usage Report has found that attempts to limit the exfiltration of data in many forms have not been successful – and in fact have gotten worse for the last three years.

Enter into this equation OpenAI’s generative artificial intelligence (AI) platform, ChatGPT. ChatGPT is a project by the OpenAI group to create a natural-language human interface for an AI. It has the ability to answer complex questions quickly and as accurately as possible, and to learn from each interaction it has with people. While this technology opens the door to amazing revolutionary changes in technology and human-machine interfaces of all kinds, it also means that hundreds of thousands of people are feeding ChatGPT new information every day – and some of that information will be in the form of PII and company confidential data shared with the platform by users not realizing the implications.

For example, not only can ChatGPT tell someone the average salary for a software engineer in the United States of America (US$91,000 per year); but can also tell you the average salary for a software engineer at Google (US$141,000 per year). While much of this data is sourced from public information – the two noted above were taken from Glassdoor, this is a great example of how information that organization may not wish to be shared ends up within the ChatGPT system and then given to anyone else with access to the system.

While those two examples are relatively benign (sharing of salary information in the United States of America is perfectly legal; ,think of the potential issues caused by employees asking questions about specific intellectual property, patent details, customer health information, or other sensitive and confidential topics.

The AI will ingest and learn from that information, making it part of its data-lake and potentially using that information when answering other questions posed by totally unrelated users. OpenAI is the most recognized vendor of this type of generative AI system, but they are by far not the only organization to offer this kind of platform; which leads to many different data-lakes of information coming from millions of queries and sources that may not be meant to be shared with the world.

While OpenAI does publicly warn users not to share sensitive or confidential information;, it is nearly impossible to stop a user from accidentally leveraging the platform in a way that could violate the myriad of laws, regulations, and individual corporate policies of all its users. Multiple countries have blocked or otherwise prohibited the use of ChatGPT, including Italy and Syria – with industry intelligence suggesting others will join this list in the near future. Their concerns range from an inability to keep users from sharing controlled information to OpenAI’s difficulties in enforcing regulations like GDPR effectively when users ask questions,; leaving the governments of these countries in a gray area when it comes to data control. Industry trends also indicate that the system will be blocked by more and more businesses as well, and for many of the same reasons.

So, what can an organization do to limit the exposure of controlled data through accidental or purposeful use of ChatGPT in an inappropriate way?

Much as trying to restrict verbal data leakage, the technical options are fairly limited and often difficult and expensive to implement. If the organization uses a firewall or proxy which can filter by domain and/or IP, the most straightforward step is to simply block all known ChatGPT websites while on company networks. While this is not perfect (IP’s change periodically, users have personal devices on mobile and other networks), it is a good first step for any organization concerned about this form of data leakage. Similar steps will need to be taken for any other known generative AI services as well, which leads to a potential game of whack-a-mole trying to catch all of them.

Advanced Data Loss Prevention (DLP) and Cloud Access Security Broker (CASB) systems may be of limited help as well. Since the data in question is effectively just a chat box, systems would need to carefully monitor all user communications from within corporate networks to the outside world. This, of course, presents several challenges. Chief amongst those challenges is the difficulty of monitoring communications over TLS-encrypted (SSL) website sessions, and the potential privacy issues any time such communications are viewed by the company or any third-party. Careful consultation with both technical and legal advisors is vital to ensure everything happens within the limits of technology and the law. If the organization chooses to go down this path, continuous testing with platforms like Cymulate to ensure the systems are working effectively is very much necessary – and Cymulate’s Advanced Scenarios module already has simulations for the exact type of testing.

User education and guidance is perhaps the most straight-forward method of controlling data leakage. While it is limited by user compliance, proper guidance to users on the impact of data sharing with AI interfaces like ChatGPT can go a long way to stemming inadvertent data leaks via what looks like a harmless chat-bot. As with any form of user awareness training, results can vary from user to user; but the overall effect will be a lower incidence of data leakage over time in the same way as we have seen a reduction in user susceptibility to phishing emails.

The issues around data leakage via ChatGPT are in no way new to the world of private and confidential information.

The ease of use, and impression that questions are just typed into a box (as opposed to knowing that they will be used in continued data growth and analysis) makes these generative AI platforms a bigger threat for data leakage than human-to-human verbal data leaks have been in the past.

AI systems do not forget, and talk to millions of people every month, not dozens as a human would in normal conversation. Careful and consistent user awareness training combined with blocking known generative AI websites and service providers are the best ways to exert control against this type of leak; with new technologies eventually evolving over time to better target and control this user behavior.

Sasha Gohman

Sasha Gohman is a seasoned professional with extensive technical and management experience, Currently, he is the Vice President of Research at Cymulate. He is integral in strategic planning, team management, and decision-making while staying abreast of the latest advances in cybersecurity.