Slide 1
The Problem With Current Security Assesments
On one end: highly formal assurance
Common Criteria:
Extremely expensive: about $1M for initial assessment
Meaningless answer:
3 bits: EAL0-7
A Òhigh assuranceÓ OS can be rooted the next day by a buffer overflow
So how much of this is ÒenoughÓ?
On the other end: Bugtraq Whack-a-mole
Chronic chain of ÒgotchaÓ vulnerability disclosures
Each disclosure tells you that you are not secure, but when you are secure is undecided
Not very helpful :)

Assessing the Assurance of Retro-Fit Security
Commodity systems (UNIX, Linux, Windows) are all highly vulnerable
Have to retrofit them to enhance security
But there are lots of retrofit solutions
Are any of them effective?
Which one is best?
For my situation?
Instead of ÒHow much security is enough for this purpose?Ó, we get ÒAmong the systems I can actually deploy, which is most secure?Ó
Consumer says ÒWe are only considering solutions on FooOS and BarOSÓ
Relative figure of merit helps customer make informed, realistic choice

Proposed Benchmark:
Relative Vulnerability
Compare a ÒbaseÓ system against a system protected with retrofits
E.g. Red Hat enhanced with Immunix, SELinux, etc.
Windows enhanced with Entercept, Okena, etc.
Count the number of known vulnerabilities stopped by the technology
ÒRelative InvulnerabilityÓ: % of vulnerabilities stopped

Can You Test Security?
Traditionally: no
Trying to test the negative proposition that Òthis software wonÕt do anything funny under arbitrary inputÓ, I.e. no surprising Òsomething elseÕsÓ
Relative Vulnerability transforms this into a positive proposition:
Candidate security enhancing software stops at least foo% of unanticipated vulnerabilities over time

Vulnerability Categories
Local/remote: whether the attacker can attack from the network, or has to have a login shell first
Impact: using classic integrity/privacy/availability
Penetration: raise privilege, or obtain a shell from the network
Disclosure: reveal information that should not be revealed
DoS: degrade or destroy service

Impact
Lower barriers to entry
Anyone can play -> more systems certified
Real-valued result
Instead of boolean certified/not-certified
Easy to interpret
Can partially or totally order systems
Empirical measurement
Measure results instead of adherence to process
Implementation measurement
CC canÕt measure most of the Immunix defenses (StackGuard, FormatGuard, RaceGuard)
RV can measure their efficacy

Issues
Does not measure vulnerabilities introduced by the enhancing technology
Actually happened to Sun/Cobalt when they applied StackGuard poorly
Counting vulnerabilities:
When l33t d00d reports Òth1s proggie has zilli0ns of bugsÓ and supplies a patch, is that one vulnerability, or many?
Dependence on exploits
Many vulnerabilities are revealed without exploits
Should the RV test lab create exploits?
Should the RV test lab fix broken exploits?
Probably yes
Exploit success criteria
Depends on the test model
Defcon Òcapture the flagÓ would not  regard Slammer as a successful exploit because payload was not very malicious

Work-factor View
Assume that well-funded attacker can penetrate almost any system eventually
The question is ÒHow long can these defensive measures resist?Ó
RV may probabilistically approximate the work factor to crack a system
foo% of native vulnerabilities are not actually exploitable
Therefore foo% of the time a well-funded attacker canÕt get in that way
Attacker takes foo% longer to get in???

Lessons Learned the Hard Way
Security advisories lie
often incomplete, or wrong
Published exploits are mostly broken, deliberately
Compiled-in intrusion prevention like StackGuard makes it expensive to determine whether the defense is really working, or if it is just an incompatibility
Also true of diversity defenses

Technology Transfer
ICSA Labs
traditionally certify security products (firewalls, AV, IDS, etc.)
no history of certifying secure operating systems
interested in RV for evaluating OS security
ICSA issues
ICSA needs a pass/fail criteria
ICSA will not create exploits