Generative AI (GenAI) is revolutionizing cybersecurity, promising faster threat detection and deeper analysis. However, realizing the potential of AI hinges on the quality and reliability of AI-generated outputs. How do we know Precision AI is giving us the right answers? How do we trust it to protect our networks?
At Palo Alto Networks, we combine traditional machine learning with cutting-edge generative artificial intelligence to ensure that our AI solutions are robust and reliable. A cornerstone of our approach is rigorous evaluation.
The Challenge: Evaluating Natural Language Outputs in Cybersecurity
Traditional machine learning models produce structured outputs well-suited to traditional evaluation metrics. Generative AI, however, often produces unstructured, free-form text. Free-form text is more difficult to evaluate with quantitative metrics. For modern systems, we need modern techniques that measure whether free-form text is accurate and trustworthy.
For example, every day Palo Alto Networks sees a massive number of suspicious executables crossing firewalls, on endpoint devices, and submitted directly from customers – and we need to explain to security professionals what the dangers and risks are. Each executable must be assessed and evaluated quickly to stop potential security breaches, but we cannot rely on AI-written explanations without being confident that the explanations are trustworthy: correct, easy to understand, and relevant.
The same issue exists when processing documents and emails to protect from sensitive data loss – how do you explain to a user what is happening and be sure the text-based answers are correct?
Our Solution: A Multi-Faceted Approach
We use a variety of evaluation techniques to help ensure that Precision AI by Palo Alto Networks is performing at its best, including BERTScore, LLM-as-a-judge, and SIDE.
When we already have a substantial dataset linking questions with known good answers, BERTScore often works well. For instance, if we want to evaluate AI-written explanations of potential malware, we might rely on an evaluation dataset of thousands of PowerShell scripts, each paired with a correct human-expert-written explanation of the script’s behavior.
Introduced in 2020 and consistently correlating well with human judgement, BERTScore produces a number between 0 and 1 that reflects how similar the AI-written text is to the human-written text. To generate the score, BERTScore weights the mathematical distances between a meaning representation of the words in the two texts.
However, we often do not have the kind of dataset that is needed for BERTScore. Building such datasets through human labor is expensive, and people tend to struggle with tasks of volume, minutiae, and repetition like writing explanatory texts. Thankfully, we have other options.
One GenAI-based option is using LLMs as judges to evaluate responses. The excitement around LLM-as-a-judge has skyrocketed since 2023, when Zheng et al. concluded that strong LLM judges like GPT-4 show 80% agreement with humans, which is on-par with how human labelers agree with each other on subjective tasks like text analysis. Only half a year later, Sun et al. concluded that GPT-4’s assessment of code explanation quality is highly correlated with human assessment.
At Palo Alto Networks, we agree that LLMs can be highly effective judges for evaluating code explanations and other free-form text outputs. We do caution potential users to use a strong LLM, frame the task carefully, and evaluate thoroughly to ensure the judge is in fact aligned with human preferences.
We also note that LLMs are the slowest and most expensive AI models available, that LLMs generate the most likely text sequence rather than the most optimal answer, and that as a field, we still lack techniques by which we can understand why a specific large language model came to make its decision.
An alternative to LLM-as-a-judge is a SIDE-style (summary alignment to code semantics) model. SIDE-as-a-judge is fast, explainable, and often just as effective as a human or a LLM in its judgments. SIDE uses a semi-supervised technique known as contrastive learning to estimate the quality of a code explanation paired with some code. In contrastive learning, we give the model examples of things that are similar and things that are different, and the model learns how to distinguish between similar and different.
In the code explanation problem, if the code explanation is appropriate for the code, SIDE assigns a high score; if the code explanation is inappropriate, SIDE assigns a low score. Although SIDE was initially applied to code, it can also be applied to other domains.
The SIDE technique often captures human quality assessments extremely well, at the cost an organization would incur to train and evaluate a custom model. We note, however, that because it is a traditional ML model, SIDE struggles to extrapolate beyond its source training data. As long as the source data for training the model is truly representative of the kinds of data that will be evaluated, SIDE can enable faster and more optimal evaluation than using LLM-as-a-judge. With faster evaluation, we can continue to improve the underlying AI more efficiently and improve security outcomes.
Each evaluation technique has distinct strengths. We select between evaluation techniques with a purpose in mind, choosing ways to enable high quality outputs that are appropriate to the problem we are solving.
The Palo Alto Networks Difference: Thoughtful, Thorough, Capable, and Correct
At Palo Alto Networks, we're committed to maintaining our top-tier Precision AI, and our rigorous evaluation process demonstrates the depth of expertise we bring to ensuring quality. This dedication to cutting-edge evaluation methods enables our AI solutions to meet the highest standards of accuracy and reliability.
By committing to the accuracy of our systems through rigorous testing and evaluation, Palo Alto Networks is leading the way in the accuracy and trustworthiness of AI-powered cybersecurity. We believe this multi-faceted approach is essential for realizing the full potential of AI’s role in protecting your organization.
Stay Informed. Stay Safe.
Explore Precision AI® : Defend against adversarial AI in real time, secure your organization's GenAI usage and AI-powered application development, and cut down on complexity with AI-driven security copilots.
Learn about our AI-powered threat intelligence: Stay informed about the latest threats and vulnerabilities with our AI-driven threat intelligence service.
Engage with our demos: See our AI cybersecurity solutions in action and learn how they can protect your organization.