Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Intrusion Tolerant Systems

2000, International Symposium on Wikis

Intrusion Tolerant Systems ✁ Partha P. Pal, Franklin Webber, Richard E. Schantz and Joseph P. Loyall BBN Technologies 10 Moulton Street Cambridge, MA 02138 ppal, fwebber, rschantz, jloyall @bbn.com ✂ 1. Background and in order to succeed they have to overcome all of them. Economic pressure dictates the use of COTS components rather than specialized ones. Component developers are under pressure to market their products quickly and hence often release products with errors and vulnerabilities waiting to be discovered. Interoperability requirements dictate that such systems be more open than closed. This openness and the distributed nature lead to more access points that an attacker could target. Finally, understanding and managing a system always lag its development and deployment: new attacks are always discovered after deployment. For these reasons, even the defense in depth approach cannot guarantee that critical systems will be completely shielded from the attackers. Some attacks will be able to compromise the barriers and alter the availability and quality of systems resources, and thereby affect the application. Because of these reasons, we want to achieve intrusion tolerance by enabling the applications survive the effects of intrusions. The rest of the paper is organized as follows. In Section 2 we first present our position for this workshop, followed by some elaboration of key ideas. Section 3 describes our recent work in this and related areas. Section 4 concludes the paper. Many intrusions on computer systems often target specific applications. By exploiting vulnerabilities in the infrastructure or in an application, intruders try to stop the application altogether or to make it behave in an abnormal way. In either case, their objective is to stop the application from being useful. Considered abstractly, there is a competition between the application (trying to deliver useful service) and the attacker (trying to prevent the application from doing so) over resources at various levels. In order to perform, the application needs resources ranging from CPU time, memory and network bandwidth to higher level ones like data and objects. When an intruder gains enough control over one or more resources, the attack is successful. For instance, a DOS (denial of service) attack on an application such as a web-server results when an attacker manages to deprive the application of the required CPU or network resources. The approach taken by traditional security engineering attempts to protect the infrastructure resources from intruders by establishing a preventive barrier. A barrier acts like a catch all guard protecting all resource consumers upstream without much interaction or cooperation between the protectorate and protectors. For instance, firewalls, a network layer barrier, do not normally interact with applications that use the network. Similarly, IDSs (Intrusion Detection Systems) operating at various system layers hardly ever cooperate among themselves or interact with other kinds of applications. While the world is gravitating towards more use of COTS components and more integration of diverse and distributed resources, security mechanisms attempting to prevent attacks are bound to be imperfect. This is evidenced in the recent “defense in depth” approach which calls for layering of defensive mechanisms of various capabilities so that an attacker faces multiple barriers 2. Our Position: Support for Intrusionaware Survivable Applications We argue that development and support of intrusionaware survivable applications, i.e., applications that react to intrusions and survive their consequences, are key problems in the area of intrusion tolerant systems. Even though the idea of intrusion-aware, survivable applications seems like a natural part of the defense in depth model, there is no easy and systematic way to support such applications in today’s distributed systems infrastructure. We introduce the required support in the middleware that mediates between the application and the infrastructure. ✄ This work is sponsored by DARPA under contracts No. F3060200-C-0172 and No. F30602-99-C-0188 1 we focus on how such an attack may manifest itself in different system layers. At the application level, we may see one or more of the following symptoms: A survivable application must incorporate a “survivability strategy”–a specification distinct from its functional requirements, for what to do when intrusions happen. Usually, these strategies involve adaptation and awareness of the environment and system resources. Our proposed middleware capabilities aim to make implementation and experimentation with survivability strategies more systematic, realistic and cost effective. Key aspects of our approach are outlined below. one or more requests timing out or throwing exception, even when retried multiple times. ☎ one or more objects crashing, perhaps repeatedly on restarts. ☎ Focus on symptoms: If an application can survive the effects caused by an intrusion, it has effectively tolerated the intrusion. Our objective is to help the application in doing so. Consequently, we focus on addressing the symptoms caused by attacks, rather than preventing or detecting, diagnosing, containing or irradicating an attack. One could argue that we are recovering from an attack in an indirect way. ☎ one or more requests blocked indefinitely. ☎ At the network level, we may see one or more of the following symptoms: abnormal traffic volume in a network segment. ☎ unexpected content in network traffic. ☎ overload/crash of network devices such as routers. ☎ At the operating system level we may see: Adaptive and unpredictable response: The ability to adapt to changing environmental and operational conditions is a key for surviving the symptoms of intrusions. However, in the context of pre-planned and coordinated attacks, adaptive responses that are predictable can be easily compromised. Therefore, adaptive responses must be unpredictable to the attacker. ☎ ☎ presence of unusual files (programs, scripts). ☎ presence of unusual processes and CPU load. ☎ ☎ unusual usage pattern of network interface and/or system calls Some of these symptoms can result from natural causes however, in the context of intrusions, symptom occurrences do not follow any natural distribution often associated with “normal” faults. Symptoms may appear simultaneously, affecting multiple components of the system. They may also appear in stages, plaguing different parts of the system one after another. These are some of the factors that make coping up with intrusion symptoms harder. We are trying to address it by enhancing fault-tolerance techniques to deal with these issues and applying the enhanced techniques to different kinds of system resources. Approaching intrusion-tolerance from the symptoms has both pros and cons. Intrusions can be very versatile. The same attack may manifest itself in different system layers in different forms under different environmental conditions. New attacks are always being devised. Trying to accurately specify all of the attacks or attack classes to tolerate can be a difficult task. Devising strategies in terms of symptoms helps us manage this complexity. A finite number of symptoms can (at least partially, if not fully) cover multiple known and unknown attacks because many attacks produce the same symptoms (such as object crash or network overload). On the other hand, not all symptoms are equally bad for every application and some may even be benign. An observed symptom may or may not be a positive indication of an intrusion either. However, if the application is equipped with a strategy to deal with most of Bridging the gap between the application and the infrastructure: It is clear that the disconnect between an application and the infrastructure prevents the application from being easily aware of and rapidly responsive to changes in the availability and quality of resources. Advanced middleware such as QuO [6], which is an adaptive software layer operating on a distributed object base, helps bridge the gap. We utilize the QuO middleware to coordinate the capabilities required for supporting intrusion aware survivable applications. The next three sections provide more details of our approach and identify several important issues that we are currently investigating. 2.1. Addressing the Symptoms In order to devise an effective survivability strategy for an application, one has to first think about the application’s survivability requirements: what kinds of intrusion are considered and what should be done to cope with them if they are even partially successful. We propose to formalize these in terms of symptoms (failures, if we follow fault-tolerance terminology) caused by the attacks as opposed to the attacks themselves. For example, instead of framing a strategy for “syn-flood” attacks, 2 these symptoms, we believe we can achieve a significant amount of intrusion tolerance. One underlying problem that complicates the task here is that responding to every observed symptom may trigger unnecessary adaptation. If we are not careful, this may launch a self-denial of service. Several possibilities are under investigation. Symptoms may be categorized based on severity and the application may respond to the most severe ones and ignore the less severe ones. Sometimes, the severity of a symptom may change depending on external events. For instance, an application may choose to respond a network layer symptom only when external IDSs suspect an ongoing network attack. Finally, formulating intrusion tolerance in terms of symptoms is not unprecedented. The kind of intrusions addressed by recent work on wrappers at NAI[1] and the work on protecting the win32 dlls at RST [4] can be viewed to be categorized in terms of symptoms, for instance, all attacks that attempt to write to the start-up folder. vivability strategy. Symptoms dictate the nature and form of the adaptation. For instance, repeated replica crash on a single host, a symptom observed at the application level, could be responded to by migrating the replica from the host in question to a different one. In the case of a pre-planned and coordinated attack, adaptive response by itself is not enough if the response can be predicted. If the attacker can accurately predict which host the replica will be migrated to, he can plan ahead to re-target the migrated replica in its new location. Therefore, adaptive responses should be unpredictable to the attacker. This way, pre-planning will be harder, and coordinating an ongoing attack to counter the adaptive responses will take longer and be less reliable. This in turn, will improve the application’s survivability. Incorporating unpredictability is also non-trivial. First, there has to be multiple options to choose from. Then, response selection mechanism should have the capability to select different response(s) to the same symptom at different times. Finally, not all of the responses may be equally effective. In particular, there may be one or more of the available options that must be engaged. The response selection mechanism must be aware of such constraints as well. 2.2. Survivability Strategies Using Adaptive and Unpredictable Responses Once we identify the symptoms to monitor for surviving, we devise the strategic responses to the symptoms. From one perspective, a strategy could be proactive or reactive. A pro-active strategy dictates a course of action (COA) in anticipation or in preparation for a future attack. For instance, anticipating simultaneous attacks on multiple replicas, one may deploy a nonreplicated stand-by which can be put in service at short notice. As an example of a pro-active strategy that has the “in preparation” flavor, consider interfacing with a firewall mechanism so that we can block IP traffic from an infected host when we know its identity. Reactive strategies involve measures taken in reaction to some observed event. For instance, when a host is under attack one might consider migrating objects and services from that host. From another perspective, a strategy can be defensive or tolerant. A defensive strategy attempts to prevent the symptom from resurfacing, whereas a tolerant strategy attempts to find ways to avoid the symptom. An example of defensive strategy is to block all IP traffic from a suspected host. Migrating replicas from the suspected host would be an example of a tolerant strategy. These different views of strategies are not mutually exclusive, and a comprehensive strategy needed by an application will probably involve all perspectives. Devising the right strategy for a particular application is non-trivial. In the initial stage of our application, we use an ad-hoc case analysis to formulate a strategy. Adaptation is a key capability for supporting any sur- 2.3. Bridging the Gap Between Application and Infrastructure Attempting to build an application with a survivability strategy within the current distributed systems infrastructure , we find that there is a disconnect between what the application needs and what the infrastructure provides. To be specific, the following are key issues that need to be addressed by developing new capabilities in the middleware-space between the application and the infrastructure: Awareness and control of infrastructure: Attacks affect availability and quality of system resources. Since a survivable application needs to cope with these effects, awareness of the resource situation is very important. In order to respond to attack symptoms effectively, the application may need to exert some control over parts of the infrastructure as well. In general, the core issue here is to make the quality and availability of resources translucent, instead of opaque as is done in older middleware. ☎ ☎ 3 Resource redundancy and its management: Preplanned and coordinated attacks may cause resources to fail in an arbitrary manner, perhaps simultaneously or in stages. As in fault-tolerance, resource redundancy is critical for tolerating such failures, and resource managers must be capable of addressing Byzantine failures. All resources that are critical for an application will have to be redundant and managed through an uniform interface so that the application can easily integrate with them to achieve the desired degree of awareness and control. Protection of resources and resource management systems is also very important. Continuing with the example adaptive response of migrating replicas, if the attacker can exploit the redundancy mechanism, he can force the replica to migrate to a host of his choice which may be easier to attack or has already been compromised. A new self-protected redundancy mechanism managing multiple resources and implementing Byzantine fault-tolerance algorithms is required. Byzantine fault-tolerant techniques are typically expensive, and the requirement for tolerating Byzantine failures in multiple system resources make it even more difficult. We are investigating engineering solutions that combine fault-tolerance and security techniques appropriately, and where graceful degradation is acceptable and the cost and quality of the solution is dynamically adjustable. ☎ we then started to look at survivability requirements of distributed applications. We presented our initial work towards this in an earlier paper[5] describing how integration with IDSs can lead to system agility and discussed various benefits of such integration. We have two ongoing projects in this area. One focuses on defensive strategies and how to build applications that incorporate defensive strategies. In the other, we are exploring intrusion tolerance by means of middleware coordinated adaptation that is unpredictable to the attacker. 4. Conclusion Our ongoing work in adaptive middleware has led to the discovery that applications that are aware of changes in their environment can participate in their own defense, recognizing the symptoms of attacks and improving the detection and responses of security and defense mechanisms operating on their behalf. Traditional faulttolerance techniques, if extended to support coordinated, malicious faults, appear promising to enhance the survival of applications. We are currently performing research to make fault-tolerance techniques suitable for the tolerance of intrusions in large-scale distributed systems. We are focusing on tolerating the symptoms of attacks; tolerating Byzantine failures that can be parts of malicious, coordinated attacks; and concentrating on unpredictable responses, both pro-active and reactive, that will prove difficult for intruders to exploit. In this paper have outlined our approach, and presented several key problems that we are currently investigating. Coordination and management of adaptation: Adaptive responses may involve changes in the application as well as changes in the way resources are managed. For a systematic implementation of a survivability strategy, this adaptive behavior needs to be separated from the application’s functional behavior. We argue that the space between the application and the infrastructure is the appropriate place for it. Adaptative responses will usually involve coordination among multiple mechanisms that are also in the middle; these mechanisms will have various responsibilities ranging from resource management to system security. It may be the case that one adaptive response will preclude another. Some of these adaptive responses may be more expensive than others. One of the underlying issues involved here is the trade-off between adaptation and the application’s performance. References [1] L. Badger. Generic software wrappers. Internet URL http://www.pgp.com/research/nailabs/secureexecution/wrappers-overview.asp, 2000. [2] BBN Distributed Systems Research Group, DIRM project team. DIRM technical overview. Internet URL http://www.dist-systems.bbn.com/projects/DIRM, 1998. [3] M. Cukier, J. Ren, C. Sabnis, D. Henke, J. Pistole, W. Sanders, D. Bakken, M. Berman, D. Karr, and R. E. Schantz. AQuA: An adaptive architecture that provides dependable distributed objects. In Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems, pages 245–253, October 1998. [4] A. Ghosh. Sandboxing mobile code. Internet URL http://www.rstcorp.com/research/sandboxing/, 2000. [5] J. P. Loyall, P. P. Pal, R. E. Schantz, and F. Webber. Building adaptive and agile applications using intrusion detection and response. In Proceedings of the ISOC Network and Distributed Systems Security Conference, Februrary 2000. [6] J. A. Zinky, D. E. Bakken, and R. E. Schantz. Architectural support for Quality of Service for CORBA objects. Theory and Practice of Object Systems, 1(3):55–73, April 1997. 3. Our Background and Recent Work In our earlier work with QoS in distributed systems, we have developed individual resource management systems (AQuA, DIRM) [3, 2] that offer a certain desired quality to the QoS-aware application via an adaptive middleware called QuO [6]. This middleware allows the application to adjust itself when resource constraints do not meet the desired level. These applications display a minimal degree of inherent survivability. With the capability of introducing environmentawareness and adaptive behavior in distributed systems, 4