In the next article, we will describe how AI Analyst interacts with enterprise data, what parts of our process, and where data goes at every step of this analytical process.
Here is the BPMN diagram of the analytical process starting from the user inquiry in some messenger (we use MS Teams as an example) and ending with the AI analyst response in the messenger. The detailed algorithm of AI Analyst’s work is not a subject of this article and will be described in the next articles. Here we’re focusing only on data privacy.
The process can be divided into the next steps:
Inquiry
In the initial stage, the AI Analyst backend acquires the user’s message through an API linked to a messaging platform — by default, Microsoft Teams. It is standard for third-party messengers to retain a complete history of messages in their databases, implying that all information exchanged in the conversation is automatically archived outside the enterprise’s infrastructure.
To maintain data security, we distribute results and charts via links rather than embedding them directly in the response messages. This approach prevents sensitive data from leaving the secure environment, though the process can be adjusted as necessary.
Classification
The classification phase is critical, where the AI analyst discerns the relevant data sets about the user’s question. It leverages a retrieval-augmented generation (RAG) approach, tapping into a database of scenarios to enrich the context of the user’s request. For example, if a query pertains to calculating EBIT% for selected months, the AI analyst utilizes RAG to pinpoint the precise algorithm needed for EBIT computation.
This step is about enhancing the context — context enrichment — where the AI analyst compiles all necessary elements for the scenario. The consolidated prompt, enriched with a detailed description of the available data, is relayed to the Large Language Model (LLM). This description includes metadata such as column names, textual explanations, and insights on missing and categorical (optional) values, rather than the data itself.
Having delineated the datasets and scenario required for addressing the query, the AI Analyst proceeds to confirm the user’s access permissions to the data, thus ensuring both the integrity and the security of the system are upheld.
Consideration on sharing Missing values and Categorical values with LLM
Missing Values
LLMs use information about missing values to suggest the best data preprocessing steps. This might include filling in gaps, removing incomplete rows, or selecting algorithms that tolerate missing data. We ensure efficiency by tailoring our code to each dataset’s specific needs.
For data security, we might limit details to only naming the columns with missing values. Our Python code recommendations include a Datrics function to handle these missing values effectively.
Categorical Values
Knowing a dataset’s categories helps the LLM craft precise code. For instance, recognizing ‘Male’ and ‘Female’ categories in a column prompts gender-sensitive analysis. We select the right preprocessing, like one-hot encoding for unordered categories, ensuring accurate analysis results.
Sensitive details can be generalized to protect privacy, such as referring to ‘locations’ or ‘gender’ broadly. We also instruct the LLM to handle categories properly in feature engineering using Datrics functions*.
While missing and categorical values are sensitive, omitting them from LLM prompts may affect code quality. However, without context, this information alone doesn’t compromise data security.
Datrics functions — is a set of code snippets that can perform high-level data analytics operations and can replace python code generation in some cases to speed up the answer.
Execution
The execution phase is anchored by two pivotal processes:
From a data privacy standpoint, the AI analyst does not transmit the actual data to the LLM. Instead, it sends the scenario outlined in the previous phase along with a dataset description — this forms the basis for the scenario generation.
The LLM develops Python code, which Datrics then executes within an isolated sandbox environment: a standalone Docker container equipped with secure access rights. Once the output is deemed satisfactory, it is stored in the local file system or block storage and subsequently made accessible to the user via a link.
To ensure robust code execution, we use a Kubernetes cluster utilizing a producer/consumer architecture based on Celery and Redis. This setup can dynamically scale resources as needed. RAM requirements for processing are estimated using heuristics derived from our empirical data on the overhead of libraries like Pandas and PyArrow.
Our current system design is modular, allowing for potential upgrades to accommodate out-of-memory computation techniques such as Apache Spark or modern, more efficient frameworks like Polars that offer reduced memory and computational overhead.
Ending
The final output from the execution phase is securely saved to the server file system and is made accessible to the authorized user via a link. Should the user wish to distribute the results to colleagues, they have the option to configure the link for broader accessibility, either publicly or within a specific domain.
Conclusion
The outlined methodology provides a secure framework for leveraging Large Language Models (LLMs) with sensitive enterprise data. While it’s possible to integrate additional safeguards — such as output guardrails — the existing system already ensures the safe employment of LLMs for handling sensitive information.
As a further recommendation, Datrics offers robust support for OpenAI’s ChatGPT Enterprise within the Azure Cloud, which includes comprehensive safety guarantees as part of its service level agreement (SLA), making it a reliable choice for enterprises.
For enhanced security, albeit with increased costs, Datrics plans to offer an open-source version of the classification model — such as a fine-tuned LLAMA13b. This can be implemented directly within an enterprise’s infrastructure, providing not only complete control over data privacy but also improved processing speed.