Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SLA Monitoring and Management Framework for Telecommunication Services

Fourth International Conference on Networking and Services (icns 2008), 2008
...Read more
SLA Monitoring and Management Framework for Telecommunication Services Jacek Kosiński, Piotr Nawrocki, Dominik Radziszowski, Krzysztof Zieliński, Sł awomir Zieliński AGH University of Science and Technology {jgk, piter, radzisz, kz, slawek} @ics.agh.edu.pl Grzegorz Przybylski, Paweł Wnęk Comarch S.A pawel.wnek@comarch.com grzegorz.przybylski@comarch.pl Abstract This paper presents SLA monitoring and management framework for telecommunication services. The basic requirements of this class of systems are specified and verified in context of existing SLA standards and tools. The proposed system architecture is very general and may interoperate with the existing performance monitoring systems and management tools. The key design effort is focused on the general mapping between observed activities of the underlying telecommunication infrastructure elements and corresponding SLA instance object state construction. The proposed framework application is illustrated by the simple case study. 1. Introduction SLAM (Service Level Agreement Management) [1] is gaining recently increasing attention of telecom services providers and customers. Due to progress in QoS services provisioning and dynamic interactive control and monitoring of telecommunication resources the SLA (Service Level Agreement) management is the most demanding functionality. From the telecom system architecture point of view - SLAM presents another layer, operating over monitoring and resource control layers and providing more abstract and consistent view of services offered to end-users. It is a next logical step of the development of these systems, connecting together a technical perspective of system infrastructure operation with market and user-driven business objectives of the telecom company [2,3]. SLA in general specifies expectations about the provided service quality seen from the end-user viewpoint which are expressed as rather aggregated metrics defined for the selected time of the service provisioning. Difficulties in SLAM construction and implementation are generated by complexity of SLA contract metrics representation and on-line, bi- directional transformation of the business perspective to the technical infrastructure operational parameters – Fig.1. Metrics formalization Negotiations Network infrastructure, Monitoring SLA Monitoring and Management Framework SLA Contract Business Polices Bidirectional transformation of QoS parameters into SLA metrics Figure 1. Information flow in SLAM framework This paper describes SLA monitoring and management framework designed and implemented under “SLA Management Framework for Telco Service Providers” Eureka Project. Successful system construction was possible due to preexisting expertise and technology in area of QoS monitoring and control provided by ComArch [8]. The most innovative part of this research concerns components for monitoring layer data transformation and their propagation to SLA management subsystem. The structure of the paper is as follows. In Section 2 short overview of exiting SLA management tools is Fourth International Conference on Networking and Services 0-7695-3094-X/08 $25.00 © 2008 IEEE DOI 10.1109/ICNS.2008.31 170
presented. Next in Section 3 SLA metrics are specified very briefly. The proposed framework architecture is presented in Section 4. Finally Section 5 contains SLAM application case study. The paper is ended with conclusions. 2. Existing SLA management standards and tools The SLA [1] management tools let enterprise IT managers look at network performance the similar way the service provider does. SLA management tools such as CISCO TMS, ViewGate Networks' Inteligo and Oblicore Guarantee are typically deployed by the service providers. Often these tools are used by the enterprise (via a Web interface), allow IT personnel to monitor network performance and manage their network SLAs in real time. An important aspect of management tools is SLA specification language such as WSLA. SLA tools should also monitor and manage SLA metrics in real time, rather than providing mere historical views of past performance. The historical information should be saved in the database. This section contains very brief overview of leading LSA management tools from the functional point of view. Cisco Total Service Management (TSM) [4] delivers on the demands of end-to-end service-level management with precise demarcation, per-tunnel measurement, and both business-level and detailed technical management reporting. Cisco TSM mainly focuses on the networking layers and makes net-centric SLAM metric information, such as round-trip latency, jitter, and response time, available to vendor applications dealing with upper and lower layers. Thus, TSM makes it possible to create an end-to-end solution by integrating cooperative applications from best-of- breed vendors. Inteligo Advisor [5] is a service provider tool for monitoring service and verifying service-level agreements (SLAs). Utilizing artificial-intelligence techniques, Inteligo Advisor predicts usage needs for virtual private networks (VPNs) and recommends service adjustments needed to insure that the service definitions are application appropriate. ViewGate Network tool also provides real-time SLA compliance verification and network-usage anomalies. This combination of functionality enables service providers to meet SLA, improves their response to changes in customer needs, and build greater customer loyalty. Oblicore Guarantee [6] is a complete software package that allows proactive, continuous, top-down management of Service Level Agreements (SLAs) and other important business obligations. This tool is a field-proven software suite that allows the client organization to meet contractual commitments, service level agreements, operational goals, more often and more the SLA business logic and compares performance objectives to real-time data aggregated across the enterprise. Web Service Level Agreement (WSLA) [7] language (based on XML) defines assertions of a service provider to perform a service according to agreed guarantees for IT-level and business process- level service parameters such as response time and throughput, and measures to be taken in case of deviation and failure to meet the asserted service guarantees. The assertions of the service provider are based on a detailed definition of the service parameters including how basic metrics are to be measured in systems and how they are aggregated into composite metrics. 3. SLA metrics definitions The purpose of service level agreements is to formally define the parameters of service the provider guarantees to deliver. After agreeing upon an SLA, the specified parameters are monitored in order to detect agreement breaches. One of main issues of SLA creation is to define methods of measuring certain service parameters that must be agreed upon by both sides of the contract. When a SLA breach is detected, an appropriate remedy procedure (also defined in the contract) is applied. Moreover, the client is eligible to get some compensation from the provider. Identifying the violations and calculating penalties might prove quite challenging. In general, a SLA contract [3] can be presented as a logical product of predicates built upon results of measuring certain service parameters. To make it more understandable, the SLA can be expressed as a hierarchy of sub-contracts that would eventually be expressed as rules defined on the measurement-based predicates. The SLA Management System (SLAM) relies on a monitoring database, which feeds it with the results of parameter measurements. The tasks of the SLAM systems are to: interpret data provided by a monitoring system in context of agreements signed by a particular provider. This functionality includes also notifying the provider about unexpected events that could lead to an SLA breach before it occurs. provide regular reports for the customer regarding the SLA parameters [1]. 171
Fourth International Conference on Networking and Services SLA Monitoring and Management Framework for Telecommunication Services Jacek Kosiński, Piotr Nawrocki, Dominik Radziszowski, Krzysztof Zieliński, Sławomir Zieliński AGH University of Science and Technology {jgk, piter, radzisz, kz, slawek} @ics.agh.edu.pl Grzegorz Przybylski, Paweł Wnęk Comarch S.A pawel.wnek@comarch.com grzegorz.przybylski@comarch.pl metrics defined for the selected time of the service provisioning. Difficulties in SLAM construction and implementation are generated by complexity of SLA contract metrics representation and on-line, bidirectional transformation of the business perspective to the technical infrastructure operational parameters – Fig.1. Abstract This paper presents SLA monitoring and management framework for telecommunication services. The basic requirements of this class of systems are specified and verified in context of existing SLA standards and tools. The proposed system architecture is very general and may interoperate with the existing performance monitoring systems and management tools. The key design effort is focused on the general mapping between observed activities of the underlying telecommunication infrastructure elements and corresponding SLA instance object state construction. The proposed framework application is illustrated by the simple case study. Business Polices Negotiations SLA Contract Metrics formalization SLA Monitoring and Management Framework 1. Introduction Bidirectional transformation of QoS parameters into SLA metrics SLAM (Service Level Agreement Management) [1] is gaining recently increasing attention of telecom services providers and customers. Due to progress in QoS services provisioning and dynamic interactive control and monitoring of telecommunication resources the SLA (Service Level Agreement) management is the most demanding functionality. From the telecom system architecture point of view - SLAM presents another layer, operating over monitoring and resource control layers and providing more abstract and consistent view of services offered to end-users. It is a next logical step of the development of these systems, connecting together a technical perspective of system infrastructure operation with market and user-driven business objectives of the telecom company [2,3]. SLA in general specifies expectations about the provided service quality seen from the end-user viewpoint which are expressed as rather aggregated 0-7695-3094-X/08 $25.00 © 2008 IEEE DOI 10.1109/ICNS.2008.31 Network infrastructure, Monitoring Figure 1. Information flow in SLAM framework This paper describes SLA monitoring and management framework designed and implemented under “SLA Management Framework for Telco Service Providers” Eureka Project. Successful system construction was possible due to preexisting expertise and technology in area of QoS monitoring and control provided by ComArch [8]. The most innovative part of this research concerns components for monitoring layer data transformation and their propagation to SLA management subsystem. The structure of the paper is as follows. In Section 2 short overview of exiting SLA management tools is 170 presented. Next in Section 3 SLA metrics are specified very briefly. The proposed framework architecture is presented in Section 4. Finally Section 5 contains SLAM application case study. The paper is ended with conclusions. field-proven software suite that allows the client organization to meet contractual commitments, service level agreements, operational goals, more often and more the SLA business logic and compares performance objectives to real-time data aggregated across the enterprise. Web Service Level Agreement (WSLA) [7] language (based on XML) defines assertions of a service provider to perform a service according to agreed guarantees for IT-level and business processlevel service parameters such as response time and throughput, and measures to be taken in case of deviation and failure to meet the asserted service guarantees. The assertions of the service provider are based on a detailed definition of the service parameters including how basic metrics are to be measured in systems and how they are aggregated into composite metrics. 2. Existing SLA management standards and tools The SLA [1] management tools let enterprise IT managers look at network performance the similar way the service provider does. SLA management tools such as CISCO TMS, ViewGate Networks' Inteligo and Oblicore Guarantee are typically deployed by the service providers. Often these tools are used by the enterprise (via a Web interface), allow IT personnel to monitor network performance and manage their network SLAs in real time. An important aspect of management tools is SLA specification language such as WSLA. SLA tools should also monitor and manage SLA metrics in real time, rather than providing mere historical views of past performance. The historical information should be saved in the database. This section contains very brief overview of leading LSA management tools from the functional point of view. Cisco Total Service Management (TSM) [4] delivers on the demands of end-to-end service-level management with precise demarcation, per-tunnel measurement, and both business-level and detailed technical management reporting. Cisco TSM mainly focuses on the networking layers and makes net-centric SLAM metric information, such as round-trip latency, jitter, and response time, available to vendor applications dealing with upper and lower layers. Thus, TSM makes it possible to create an end-to-end solution by integrating cooperative applications from best-ofbreed vendors. Inteligo Advisor [5] is a service provider tool for monitoring service and verifying service-level agreements (SLAs). Utilizing artificial-intelligence techniques, Inteligo Advisor predicts usage needs for virtual private networks (VPNs) and recommends service adjustments needed to insure that the service definitions are application appropriate. ViewGate Network tool also provides real-time SLA compliance verification and network-usage anomalies. This combination of functionality enables service providers to meet SLA, improves their response to changes in customer needs, and build greater customer loyalty. Oblicore Guarantee [6] is a complete software package that allows proactive, continuous, top-down management of Service Level Agreements (SLAs) and other important business obligations. This tool is a 3. SLA metrics definitions The purpose of service level agreements is to formally define the parameters of service the provider guarantees to deliver. After agreeing upon an SLA, the specified parameters are monitored in order to detect agreement breaches. One of main issues of SLA creation is to define methods of measuring certain service parameters that must be agreed upon by both sides of the contract. When a SLA breach is detected, an appropriate remedy procedure (also defined in the contract) is applied. Moreover, the client is eligible to get some compensation from the provider. Identifying the violations and calculating penalties might prove quite challenging. In general, a SLA contract [3] can be presented as a logical product of predicates built upon results of measuring certain service parameters. To make it more understandable, the SLA can be expressed as a hierarchy of sub-contracts that would eventually be expressed as rules defined on the measurement-based predicates. The SLA Management System (SLAM) relies on a monitoring database, which feeds it with the results of parameter measurements. The tasks of the SLAM systems are to: • interpret data provided by a monitoring system in context of agreements signed by a particular provider. This functionality includes also notifying the provider about unexpected events that could lead to an SLA breach before it occurs. • provide regular reports for the customer regarding the SLA parameters [1]. 171 provide data for external systems for calculating compensations for SLA breaches. The compensations are typically expressed in terms of rebates or payment-free periods of service provision the client gets in case of contract breach. Although the calculation details are covered by the contract, and the SLAM system itself is not intended to calculate the penalties, a few factors regarding the calculations were taken into consideration in the design phase in order to point out the data that should be delivered to the calculation subsystem. On one hand, to allow maximum flexibility of penalty calculations, fine-grained output seems to be required. On the other, the SLAM is not expected to duplicate the monitoring system functionality. An approach that was selected was to define an interface between SLAM and monitoring system(s) that would be used, if needed, to deliver the fine-grained data. GUI (User console) • Object Repository SLA Inventory Alarm Repository Service Inventory PM Inventory (KPI def.) Inventory database Trouble Ticket Network Inventory Services configuration Data collections and calculations definitions SLM System State changes (API) SLA Monitoring Engine PM System Service Monitoring Engine Triggers processes PMEngine KPI (raw data) Billing System KPI, counters Reporting Engine SQL System Task Scheduler SQL System CDR (SLA violations) PM database (KPI, counters storage) SLM Data Storage (service state history, KPI, SLA parameters) Figure 2. SLAM system architecture 5. SLA framework architecture Figure 2 describes the overall architecture schema of the SLAM framework. Modular approach allows, by the use of mediation agents, to collect data from different data sources such as: Trouble Ticketing system, Performance Management database or directly from the network elements. Such approach enables to use the SLAM framework, either as an umbrella system for other vendor specific, or client specific systems, or as a stand alone solution. Effective implementation of SLA management framework requires specific infrastructure services to be accessible. This includes data sources together with specific adapters to collect and process the necessary data, graphical user interface, and the communication bus. The framework requires also to be deployed over real network architecture. This process needs access to specific inventory database to retrieve information about physical description, parameters and localization of each item (network device, interface, host) and how it is interconnected. Because the SLAM system has to process different types of information – the modular approach has been proposed as the most suitable for such implementation. Two types of information have been identified to be the most important. The first type it is the Fault Management information, either retrieved from existing Fault Management systems, or gathered directly from network elements. This information should be processed in the Service Monitoring module in the SLAM system. The second type of information it is performance related KPIs (key Performance Indicators), and heavily time dependant. This information should be processed by the Performance Management System in which the Performance Management engine is the most important. SLM SLA Template Product SLA Customer Service Template Service Inventory Service Access Point Template Service Access Point Network Element Figure 3. Logical architecture of the SLAM framework The details of an example implementation will be provided later in this paper in the Case Study section. The logical architecture of the system presented in Fig.3 is composed of different levels containing: • Service Access Points (SAP) – related to network elements (physical devices or components of these devices), • Services – composed of different service nodes or/and other services, 172 the parameter of the system. The performance issues and data delivery schedules determine the frequency of calculations of states for each of the objects. Products – understood as groups of services, SLAs – linked directly to products. Each SAP, Service, SLA or Product is defined as a template object first and then instantiated as a working instance (see Fig.4). SAP represents the relation between service and network infrastructure. SAP Template at the same time is a template object for SAP and in addition to it’s attributes, SAP Template object consists of set of rules for easy assigning appropriate Network Element objects to SAP objects related to a given SAP Template. Service Object is a single element of service-tree which is the representation of the real-life service structure and consists of the following attributes: • current state of the element, • event propagation formula for given element, • KPI propagation formula for given element. • • Template SLAs SLAM Engine „Near realtime” KPI processing „Long-term” KPI processing „Delayed” data The proposed concept is presented in Fig.5. The “near real time” mode processes data incoming from the network infrastructure (events, KPIs) in almost real-time. All parameters of services and SLAs are calculated continuously with configurable time period (depending on number of monitored services and SLAs and hardware specification the „real time” calculation period can vary between 1 minute to 60 minutes). Because not all network devices provide required data in short periods of time, it is possible that results of „real time” processing can be inaccurate (for example KPIs can be retrieved form some devices once an hour or even rarely). In order to process historical or outdated data coming from network infrastructure (events, KPIs) the „Long-term” processing mode has been implemented. It recalculates all parameters of services and SLAs taking care that all the information which was not available during „near real time” processing will be processed tin the “long-term” cycle. „Long-term” processing helps to improve the results of evaluation of parameters of services and SLAs. It is possible to define more than one „long-term” processes running with different delays and in different schedules, but at least one „long-term” processing must be running to generate reliable data for SLAs after accounting period. Events from the (see Fig.6) network are collected by the Fault Management system and processed through filters classifying them into three categories (CRITICAL, WARNING, CLEAR). They are used to define the state of the SAP which is then propagated upwards the service tree structure. For each service object the service propagation rules decide which events should be considered to put the service in a SLAs Customers Services Network infrastructure Available events „Long-term” Event processing Figure 5. “Near real time” and “Long-term” information processing Instances Services „Near realtime” Event processing „Real-time” data Service & SLA I Products More precise monitoring results Approximate monitoring Available KPIs Available network infrastructure types Figure 4. Instances and Templates in the SLAM framework Each service object may have one of the two processing modes: Automatic - the state of the Service object is automatically calculated by the engine upon event propagation formula, Manual - the state of the Service is set by operator. Similarly to SAP Template Service Template object is used to define service hierarchy and processing rules only. This object is not processed by SLAM Engine during Service and SLA monitoring process. Product object is used to group service tree-like structures. The template of the SLA contract can only be created in context of Product. Finally the SLA Object representing SLA contract consists of the set of common SLA parameters (Availability, Max Time to Restore, Time to Violate) and SLA KPIs, and it is instantiated according SLA Template. Only one Product Object can be attached to the SLA Template. The "near real time" computation approach has been identified as the most suitable for updating parameters of the SLA objects. The required granularity of time is 173 specific state. For SLA object the parameters are calculated and after checking for thresholds the values of SLA KPIs, SLA violation alarms are generated if necessary. The SLA KPIs are specific for SLA and are somewhat different form the performance indicators for services and SAPs. rules can be defined in order to calculate the penalties resulting from SLA violations. SLA management framework can also be integrated with CRM system, and certain access rights to some information can be given to sales agents or account managers to help them manage specific SLAs of their clients retrieve information about their states. KPIs from the network Events from the network 6. SLA application case study SAP objects The prototype of SLAM framework system has been installed as a pilot implementation for a large national fixed telephony and internet service provider. The solution is now used to monitor state of the 4 VPN(s) (Virtual Private Networks) and to calculate the SLA parameters of the SLA contracts. Each node of the particular VPN is connected to the different interface of the core network routers and is monitored independently. The monitoring of the VPN node is performed by Service Assurance Agent which is embedded part of the OS of the Cisco routers. The agent periodically sends the configured amount of data packets to the Customer Premises Equipment (CPE) where Agent Responder is installed. Therefore the agents provide following metrics of the link: availability (0 – link is available, 1 – link unavailable), average round trip time [ms], round trip time jitter [ms], packet loss [%], incoming traffic (kb/s), outgoing traffic (kb/s). The measured values are accessible by means of SNMP protocol and gathered from the routers every five minutes and stored in flat files by an external application. Periodically (every hour) the files with performance data are uploaded into two databases: production one and developer one. The requirements were defined such, that only the performance indicators were monitored, and no fault management data were taken into consideration. The performance data are stored every hour into dedicated database (Sybase ASE-15). Several thresholds for availability and performance parameters have been defined and penalty calculation formulas have been implemented according to the requirements of the ISP. The mediation used to fetch the data from the dedicated mediation database running Sybase is composed of the two server processes: • process querying the relational database, • “adapter” which is a translator between standard SLAM system data request API calls to the process which queries the SQL database. Once data are fetched the process creates the Performance Management (PM) Module data response and sends it back to the PM Module where the data is Service objects SLA objects Values of SLA parameters SLA violation alarms Figure 6. Events and KPIs processing The KPIs for Service Access Points are aggregated on each level of the structure. It is possible to define separate aggregation formulas for each type of SAP or Service. After aggregation the KPIs are used to define the state of the Service and then propagated upwards to define the state of the SLA. Three main modes have been implemented to manage the SLA contracts: • „Real time” reporting with Alarm List, • „On demand” reporting with Reporting Module, • Automatic, scheduled reporting with separate Reporting Module. The Alarm List, enables (with use of specific filters) to show only this information which is at the moment the most relevant and important for the operator. The operator can monitor the states and KPIs of the SLAs, and in case of an emergency can “drill-down” through the whole tree like structure of the services and SAPs exactly to the network element which causes problems. This speeds up the reaction times, and helps to cut down on penalty costs resulting from SLA violations. The “on demand” reporting helps the operator to see all the aggregated KPIs of the selected service and SLA. It might also be useful before actually signing and implementing the SLA, in order to prepare feasibility study if the expected parameters can possibly be met. Automatic reporting can be scheduled after the SLA has been signed in order to prepare detailed billing information and SLA report for the customer. Specific 174 processed (KPI are calculated and stored into internal database and could be utilized by SLAM module). The inventory data model has been configured in order to store specific objects required to create performance data collection processes, two new classes have been created in order to map the inventory model on it’s representation in the Mediation Database. The first class (SAACollector) represents the connection between the interface on the ISP router and the customer site. For each VPN node one instance of this object has been created. The second class (VPNSite) represents a customer site as a part as the particular VPN. Because the SLAM framework philosophy requires first to create templates and then instantiate them - six templates have been defined: one SAP template (defines template for the VPN node link and specifies which KPIs are propagated to the service level), one service template (using the previously defined SAP template with the specification which KPIs are available at the SLA level), and one product template (containing the previously defined service template) and three SLA templates. Then the specific instances of the objects and SLA contracts have been created out of the previously defined templates. The correlation system has been used in this implementation to perform custom action when the PM Threshold alarms are generated. This kind of alarms notifies the user that bandwidth usage is above given threshold and the penalties should not be calculated for the KPI(s). The reporting functionality allows the users to: • display the current KPI values in the graphical way (chart), • monitor the status of the services and SLA contract in so called “near real-time” by means of the Alarm list and dedicated views available in SLAM module, • calculate the penalties for the SLA violations by means of predefined report templates. Therefore the SLAM system has been implemented successfully in a real environment. All the predefined requirements of the client have been met, and presently the system is being evaluated from business perspective. The constructed mechanisms of data flow and transformation between the monitoring layer and the SLA layer are rather general. They benefit from clear object oriented approach and representation of the services as a tree structure. A very sensitive point of the proposed solution is the SLA evaluation procedure. Splitting this procedure into “near real time” and “long-term” phases seems to be the only solution which could satisfy rather general needs. The proposed SLAM system follows an open loop control model without feedback between observed SLA values and activities that should be taken on network infrastructure layer to achieve SLA objectives. Closing this loop it another challenging task should be resolved. Acknowledgement This research was partially supported by Polish Ministry of Education and Science; grant no. E! 3152 SLAM. 8. References [1] Carr Jim, Service Level Agreements, CMP, Inc., 2001. [2] Service-Level Management: Defining and Monitoring Service Levels in the Enterprise, White Paper, Cisco Systems, Inc., 2001. [3] John Lee, Ron Ben-Natan, Integrating Service Level Agreements: Optimizing Your OSS for SLA Delivery, Wiley, 2002. [4] Rick Sturm, Wayne Morris, Foundations of Service Level Management, Sams, 2000. [5] Viewgate's Advisor Product Enables Service Providers to Deliver on VPN SLAs and Build Customer Loyalty , Viewgate Networks, ISP Business, July 2001. [6] Oblicore Guarantee, Oblicore, www.oblicore.com. [7] WebService Level Agreements www.research.ibm.com/wsla/. [8] 7. Conclusions The proposed framework for SLA monitoring and management represents rather general solution that may be customized according the offered services provisioning, SLA definition requirements and underlying monitoring system infrastructure. 175 – WSLA, IBM, ComArch OSS Suite, ComArch, www.comarch.com.