Actual case of indecorous test

An example that demonstrates test results with a lot of room for error is a recent benchmark test, run by NSS Labs, of “advanced endpoint protection” products.

Some time ago, we published a post explaining how businesses should navigate the vast sea of benchmark tests and testing organizations. The overarching idea boiled down to the following: For a test to be trustworthy, its methodology and published results must be transparent. That means all participants should be aware of testing conditions and assured there are no mistakes in evaluation. Also, results must be verifiable and reproducible.

Those conditions might seem obvious — not to mention, already satisfied. After all, all testers have available “Methodology” in their credentials or on their websites, where testing scenarios are described in detail, even including the selection of malware used in the test run.

In fact, however, it’s not enough. A result such as “The product detected 15 of 20 new malware samples” has no practical value. What malware was it? Did it bring any real threat? Were the samples different? Did the tester double-check the results? And if so, how did they check?

Ultimately, there’s a lot of room for ambiguity. One example that demonstrates seemingly straightforward results with a lot of room for error is a recent benchmark test, run by NSS Labs, of “advanced endpoint protection” products.

What happened?

We take third-party benchmark testing very seriously. One of the best reasons to support a wide field of testers is that different testers use different security testing methodologies, and we need to know how our products perform in such conditions. Having obtained test results, we need to see what was happening at each stage of the benchmarking process. This enables us to identify bugs in our products (they do sometimes exist, unfortunately) and, possibly, the tester’s mistakes. The latter could be the result of outdated databases, a bad connection with the vendor’s cloud services, a sample malfunction, or misinterpretation of test results.

This is why we ask researchers to enable tracing in the product. Usually, it is done. Moreover, in this case, we were allowed to remotely access the test stand and apply needed settings to the product.

Naturally, as soon as we saw the test results in the case of the NSS Labs benchmark tests, we decided to analyze the logs. When we did, we found that some of the malware samples our product allegedly missed were detected by both static (which do not require the sample to be executed) and behavioral technologies. Moreover, these files were detected by signatures that our databases have contained for quite a long time (some of them were there as of 2014). That seemed odd.

Then we found out we could not study the logs because tracing was disabled. This point is already enough to bring the test results into question. We kept looking and found more.

Which threats were used in the test?

To reproduce an experiment and understand why a solution did not deliver, a vendor needs to see all of the relevant details. To make that possible, testers usually upload malware samples, provide a capture of network traffic patterns, and, if the attack techniques were widespread, explain how it can be repeated with the aid of known malware kits (such as Metasploit, Canvas, and the likes of them).

Well, NSS Labs refused to provide some malware samples, which they called someone’s intellectual property. They did eventually provide the CVE ID of the vulnerability that was exploited when our product was allegedly unable to stop the attack. The vulnerability was not included into known and available kits, and so we were unable to reproduce and subsequently verify the attack.

In no way are we trying to skirt the rules protecting intellectual property. However, if protecting intellectual property might compromise the transparency of the test, that should have been discussed with participating vendors in advance. No one would have been hurt if vendors had studied the technique of the attack under a nondisclosure agreement.

Of course, the more sophisticated and unique attack scenarios used, the more valuable the test. But of what value are the results of an attack that is unverifiable and unreproducible? Our industry’s ultimate goal is protecting end users. So, of course you can find imperfections in a product and cite them in the results, but you should also enable companies to better protect users.

In a nutshell, we were not able to obtain immediate proof of the majority of flaws in our product.

What files were used in the benchmark tests?

As a rule, in a benchmark test, security products are challenged to respond to malicious files. If a product detects and blocks them, it gets a point. But there are also tests in which a product has to analyze a legitimate file to test the former for false positives. If a product lets a clean file run, it gets another point. That seems straightforward.

But the NSS Labs test files included the original PsExec utility of the PCTools (SysInternals) pack by Mark Rusinovich. Although the utility had Microsoft’s valid digital signature, for purposes of testing, NSS Labs decided the products should deem it malicious. But many system administrators use PsExec to automate their work. Following the test’s logic, we would also judge malicious cmd.exe and a whole range of similar, legitimate programs.

Of course, it’s undeniable that this tool could be used with malicious intent. But for this test scenario to be appropriate, an entire attack killchain would have to be experimentally verified. A successful attack would constitute a failure of the product, of course. Expecting a security solution to detect the “malicious” file as such detracts from the utility of the test results.

Also, vendors and testers have differing views on “potentially malicious” software. We think it should not be seen as straight-out malicious — for some users, they are legitimate tools. Others consider them potentially risky. Using such samples in benchmarks is at best inappropriate.

Which version of product was tested?

While we were studying the results, we found out that in the majority of testing scenarios, our products’ databases had not been updated for more than a month — a fact somehow omitted in the final report. The researchers said it was OK: Some users won’t have updated their installed products, after all.

No competitive security solution relies solely on a database these days, and so naturally, benchmark tests also include heuristics and behavioral analysis. But, according to the testing methodology in this case, the purpose was to test straightforward blocking and detection — not heuristics only.

In any case, all test participants should be on the same page in terms of updates. Otherwise, how could the test be comparative?

The interaction

To eliminate potential testing mistakes, it’s essential for all parties to have access to the most detailed information. In this respect, the interaction with the researchers was a bit weird, and it challenged our attempts to understand both the process of the test and the resulting documentation. The timing also created some issues, with shifting deadlines in the run up to RSA making thorough analysis, Q&A, and reproduction impossible in the conference timeline. Finally, the resource for vendors to obtain samples was problematic — some files were missing, other files were continuously replaced, and some files did not match the table of results.

Considering all of the above, can we agree the results were not fair and transparent?

We are now negotiating with the testing lab, and some of our claims have already been accepted by the tester. However, we have yet to reach consensus on many other issues. For example, the tester cited one malware sample that detected a working security product in the system and aborted the malicious behavior. Because it did not exhibit any malicious behavior, it was not detected, and the tester called that a failure of the product. However, a user would be protected from a threat thanks to the security product. We call that success.

Summing up all of the above, we have come up with a list of suggested requirements for benchmark tests. We think tester compliance is critical to ensure research transparency and impartiality. Here are the requirements:

  • A tester must present logs, a capture of traffic patterns, and proof of product success or failure;
  • A tester must provide files or reproducible test cases, at least for those the product allegedly failed to detect;
  • Solid proof is required that the attack simulated during the test and undetected by a product, indeed inflicted harm on the test system. Potentially malicious software should be considered a part of a particular test case, with nondetection considered a failure only upon the proof the sample inflicts real damage;
  • Clean files may and should be used in a test to check for false positives, but they cannot be treated as threats based on potential for misuse. (Also, a modified clean sample should not be considered a threat — whatever modifications are used, the files continue to be “clean”);
  • The materials serving as proof of results must be provided to participants on a fair and equal basis.
Tips