AI modeling for source code understanding tasks has been making significant progress, and is bein... more AI modeling for source code understanding tasks has been making significant progress, and is being adopted in production development pipelines. However, reliability concerns, especially whether the models are actually learning task-related aspects of source code, are being raised. While recent model-probing approaches have observed a lack of signal awareness in many AI-for-code models, i.e. models not capturing task-relevant signals, they do not offer solutions to rectify this problem. In this paper, we explore data-driven approaches to enhance models’ signal-awareness: 1) we combine the SE concept of code complexity with the AI technique of curriculum learning; 2) we incorporate SE assistance into AI models by customizing Delta Debugging to generate simplified signal-preserving programs, augmenting them to the training dataset. With our techniques, we achieve up to 4.8x improvement in model signal awareness. Using the notion of code complexity, we further present a novel model lear...
Modern cloud applications are distributed across a wide range of instances of multiple types, inc... more Modern cloud applications are distributed across a wide range of instances of multiple types, including virtual machines, containers, and baremetal servers. Traditional approaches to monitoring and analytics fail in these complex, distributed and diverse environments. They are too intrusive and heavy-handed for short-lived, lightweight cloud instances, and cannot keep up with rapid the pace of change in the cloud with continuous dynamic scheduling, provisioning and auto-scaling. We introduce a unified monitoring and analytics architecture designed for the cloud. Our approach leverages virtualization and containerization to decouple monitoring from instance execution and health. Moreover, it provides a uniform view of systems regardless of instance type, and operates without intervening with the end-user context. We describe an implementation of our approach in an actual deployment, and discuss our experiences and observed results.
Proceedings of the 5th International Workshop on Container Technologies and Container Clouds, 2019
Linux containers are key enablers for building microservices. The application's microservices... more Linux containers are key enablers for building microservices. The application's microservices fall broadly under two categories, the core-microservices implementing the business logic and the utility-microservices implementing middleware functionalities. Such functionalities include vulnerability scanning, monitoring, telemetry, etc. Segregating the utility-microservices in separate containers from the core-microservice containers may prevent them from achieving their functionality. This is due to the strong isolation between containers. By diffusing the boundaries between containers we can fuse them together and enable close collaboration. However, this raises several security concerns, especially that the utility-microservices may include vulnerabilities that threaten the entire application. In this paper, we analyze the different techniques to enhance the security of container fusion and present an automated solution based on Kubernetes to configure utility-microservices cont...
We explore the applicability of Graph Neural Networks in learning the nuances of source code from... more We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. Specifically, whether signatures of vulnerabilities in source code can be learned from its graph representation, in terms of relationships between nodes and edges. We create a pipeline we call AI4VA, which first encodes a sample source code into a Code Property Graph. The extracted graph is then vectorized in a manner which preserves its semantic information. A Gated Graph Neural Network is then trained using several such graphs to automatically extract templates differentiating the graph of a vulnerable sample from a healthy one. Our model outperforms static analyzers, classic machine learning, as well as CNN and RNN-based deep learning models on two of the three datasets we experiment with. We thus show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches. (Submitted Oct...
Application runtimes are undergoing a fundamental transformation in the cloud, from general-purpo... more Application runtimes are undergoing a fundamental transformation in the cloud, from general-purpose operating systems (OSes) in virtual machines (VMs) to lightweight, minimal OSes in microcontainers. On one hand, such transformation is helping reduce application footprint in the cloud to increase agility, density and to minimize attack surface. On the other hand it makes it challenging to implement system and application management tasks. Inspired from the on-demand Function as a Service (FaaS) model in serverless computing, in RECap we are designing a cloud-native solution to deliver systems and application management tasks through specially-managed Capsule containers. Capsule containers are dynamically attached to the running containers for the duration of their implemented function and are safely removed from application context afterwards. More generally, RECap framework allows us to design disaggregated on-demand managed service delivery for containers in the cloud. In this pap...
We introduce a new mechanism to securely extend systems data collection software with potentially... more We introduce a new mechanism to securely extend systems data collection software with potentially untrusted third-party code. Unlike existing tools which run extension modules or plugins directly inside the monitored endpoint (the guest), we run plugins inside a specially crafted sandbox, so as to protect the guest as well as the software core. To get the right mix of accessibility and constraints required for systems data extraction, we create our sandbox by combining multiple features exported by an unmodified kernel. We have tested its applicability by successfully sandboxing plugins of an opensourced data collection software for containerized guest systems. We have also verified its security posture in terms of successful containment of several exploits, which would have otherwise directly impacted a guest, if shipped inside third-party plugins.
Applications have commonly been oblivious to their cloud runtimes. This is primarily because they... more Applications have commonly been oblivious to their cloud runtimes. This is primarily because they started their journey in IaaS clouds, running on a guestOS inside VMs. Then to increase performance, many guestOSes have been paravirtualized making them virtualization aware, so that they can bypass some of the virtualization layers, as in virtio. This approach still kept applications unmodified. Recently, we are witnessing a rapid adoption of containers due to their packaging benefits, high density, fast start-up and low overhead. Applications are increasingly being on-boarded to PaaS clouds in the form of application containers or appc, where they are run directly on a cloud substrate like Kubernetes or Docker Swarm. This shift in deployment practices present an opportunity to make applications aware of their cloud. In this paper, we present Paracloud framework for application containers and discuss the Paracloud interface (PaCI) for three cloud operations namely migration, auto-scal...
Identifying vulnerable code is a precautionary measure to counter software security breaches. Ted... more Identifying vulnerable code is a precautionary measure to counter software security breaches. Tedious expert effort has been spent to build static analyzers, yet insecure patterns are barely fully enumerated. This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program, in order to improve prediction performance. Compared with a generic GNN, our enhancements include a synthesis of multiple representations learned from the several parsed graphs of a program, and a new training loss metric that leverages the fine granularity of labeling. Our model outperforms multiple text, image and graph-based approaches, across two real-world datasets.
Proceedings of the ACM Symposium on Cloud Computing, 2021
Cloud maturity and popularity have resulted in Open source software (OSS) proliferation. And, in ... more Cloud maturity and popularity have resulted in Open source software (OSS) proliferation. And, in turn, managing OSS code quality has become critical in ensuring sustainable Cloud growth. On this front, AI modeling has gained popularity in source code understanding tasks, promoted by the ready availability of large open codebases. However, we have been observing certain peculiarities with these black-boxes, motivating a call for their reliability to be verified before offsetting traditional code analysis. In this work, we highlight and organize different reliability issues affecting AI-for-code into three stages of an AI pipeline- data collection, model training, and prediction analysis. We highlight the need for concerted efforts from the research community to ensure credibility, accountability, and traceability for AI-for-code. For each stage, we discuss unique opportunities afforded by the source code and software engineering setting to improve AI reliability.
AI modeling for source code understanding tasks has been making significant progress, and is bein... more AI modeling for source code understanding tasks has been making significant progress, and is being adopted in production development pipelines. However, reliability concerns, especially whether the models are actually learning task-related aspects of source code, are being raised. While recent model-probing approaches have observed a lack of signal awareness in many AI-for-code models, i.e. models not capturing task-relevant signals, they do not offer solutions to rectify this problem. In this paper, we explore data-driven approaches to enhance models’ signal-awareness: 1) we combine the SE concept of code complexity with the AI technique of curriculum learning; 2) we incorporate SE assistance into AI models by customizing Delta Debugging to generate simplified signal-preserving programs, augmenting them to the training dataset. With our techniques, we achieve up to 4.8x improvement in model signal awareness. Using the notion of code complexity, we further present a novel model lear...
Modern cloud applications are distributed across a wide range of instances of multiple types, inc... more Modern cloud applications are distributed across a wide range of instances of multiple types, including virtual machines, containers, and baremetal servers. Traditional approaches to monitoring and analytics fail in these complex, distributed and diverse environments. They are too intrusive and heavy-handed for short-lived, lightweight cloud instances, and cannot keep up with rapid the pace of change in the cloud with continuous dynamic scheduling, provisioning and auto-scaling. We introduce a unified monitoring and analytics architecture designed for the cloud. Our approach leverages virtualization and containerization to decouple monitoring from instance execution and health. Moreover, it provides a uniform view of systems regardless of instance type, and operates without intervening with the end-user context. We describe an implementation of our approach in an actual deployment, and discuss our experiences and observed results.
Proceedings of the 5th International Workshop on Container Technologies and Container Clouds, 2019
Linux containers are key enablers for building microservices. The application's microservices... more Linux containers are key enablers for building microservices. The application's microservices fall broadly under two categories, the core-microservices implementing the business logic and the utility-microservices implementing middleware functionalities. Such functionalities include vulnerability scanning, monitoring, telemetry, etc. Segregating the utility-microservices in separate containers from the core-microservice containers may prevent them from achieving their functionality. This is due to the strong isolation between containers. By diffusing the boundaries between containers we can fuse them together and enable close collaboration. However, this raises several security concerns, especially that the utility-microservices may include vulnerabilities that threaten the entire application. In this paper, we analyze the different techniques to enhance the security of container fusion and present an automated solution based on Kubernetes to configure utility-microservices cont...
We explore the applicability of Graph Neural Networks in learning the nuances of source code from... more We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. Specifically, whether signatures of vulnerabilities in source code can be learned from its graph representation, in terms of relationships between nodes and edges. We create a pipeline we call AI4VA, which first encodes a sample source code into a Code Property Graph. The extracted graph is then vectorized in a manner which preserves its semantic information. A Gated Graph Neural Network is then trained using several such graphs to automatically extract templates differentiating the graph of a vulnerable sample from a healthy one. Our model outperforms static analyzers, classic machine learning, as well as CNN and RNN-based deep learning models on two of the three datasets we experiment with. We thus show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches. (Submitted Oct...
Application runtimes are undergoing a fundamental transformation in the cloud, from general-purpo... more Application runtimes are undergoing a fundamental transformation in the cloud, from general-purpose operating systems (OSes) in virtual machines (VMs) to lightweight, minimal OSes in microcontainers. On one hand, such transformation is helping reduce application footprint in the cloud to increase agility, density and to minimize attack surface. On the other hand it makes it challenging to implement system and application management tasks. Inspired from the on-demand Function as a Service (FaaS) model in serverless computing, in RECap we are designing a cloud-native solution to deliver systems and application management tasks through specially-managed Capsule containers. Capsule containers are dynamically attached to the running containers for the duration of their implemented function and are safely removed from application context afterwards. More generally, RECap framework allows us to design disaggregated on-demand managed service delivery for containers in the cloud. In this pap...
We introduce a new mechanism to securely extend systems data collection software with potentially... more We introduce a new mechanism to securely extend systems data collection software with potentially untrusted third-party code. Unlike existing tools which run extension modules or plugins directly inside the monitored endpoint (the guest), we run plugins inside a specially crafted sandbox, so as to protect the guest as well as the software core. To get the right mix of accessibility and constraints required for systems data extraction, we create our sandbox by combining multiple features exported by an unmodified kernel. We have tested its applicability by successfully sandboxing plugins of an opensourced data collection software for containerized guest systems. We have also verified its security posture in terms of successful containment of several exploits, which would have otherwise directly impacted a guest, if shipped inside third-party plugins.
Applications have commonly been oblivious to their cloud runtimes. This is primarily because they... more Applications have commonly been oblivious to their cloud runtimes. This is primarily because they started their journey in IaaS clouds, running on a guestOS inside VMs. Then to increase performance, many guestOSes have been paravirtualized making them virtualization aware, so that they can bypass some of the virtualization layers, as in virtio. This approach still kept applications unmodified. Recently, we are witnessing a rapid adoption of containers due to their packaging benefits, high density, fast start-up and low overhead. Applications are increasingly being on-boarded to PaaS clouds in the form of application containers or appc, where they are run directly on a cloud substrate like Kubernetes or Docker Swarm. This shift in deployment practices present an opportunity to make applications aware of their cloud. In this paper, we present Paracloud framework for application containers and discuss the Paracloud interface (PaCI) for three cloud operations namely migration, auto-scal...
Identifying vulnerable code is a precautionary measure to counter software security breaches. Ted... more Identifying vulnerable code is a precautionary measure to counter software security breaches. Tedious expert effort has been spent to build static analyzers, yet insecure patterns are barely fully enumerated. This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program, in order to improve prediction performance. Compared with a generic GNN, our enhancements include a synthesis of multiple representations learned from the several parsed graphs of a program, and a new training loss metric that leverages the fine granularity of labeling. Our model outperforms multiple text, image and graph-based approaches, across two real-world datasets.
Proceedings of the ACM Symposium on Cloud Computing, 2021
Cloud maturity and popularity have resulted in Open source software (OSS) proliferation. And, in ... more Cloud maturity and popularity have resulted in Open source software (OSS) proliferation. And, in turn, managing OSS code quality has become critical in ensuring sustainable Cloud growth. On this front, AI modeling has gained popularity in source code understanding tasks, promoted by the ready availability of large open codebases. However, we have been observing certain peculiarities with these black-boxes, motivating a call for their reliability to be verified before offsetting traditional code analysis. In this work, we highlight and organize different reliability issues affecting AI-for-code into three stages of an AI pipeline- data collection, model training, and prediction analysis. We highlight the need for concerted efforts from the research community to ensure credibility, accountability, and traceability for AI-for-code. For each stage, we discuss unique opportunities afforded by the source code and software engineering setting to improve AI reliability.
Uploads
Papers by Sahil Suneja