Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization... more Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization and analysis. Numerous powerful tools and techniques have been developed for linguistic analysis of Samskrit and Indic language texts. However, the key challenge today is employing them together on large document collections and building higher level end-user applications to make Indic knowledge texts intelligible. We believe the chief hurdle is the lack of an end-to-end, secure, decentralized system platform for (i) composing independently developed tools for higher-level tasks, and (ii) employing human experts in the loop to work around the limitations of automated tools to ensure curated content always. Such a platform must define protocols and standards for interoperability and reusability of tools while enabling their autonomous evolution to spur innovation. This paper describes the architecture of an Internet platform for end-to-end Indic knowledge processing called Vedavaapi that...
Khazana is a peer-to-peer data service that supports efficient sharing and aggressive caching of ... more Khazana is a peer-to-peer data service that supports efficient sharing and aggressive caching of mutable data across the wide area while giving clients significant control over replica divergence. Previous work on wide-area replicated services focussed on at most two of the following three properties: aggressive replication, customizable consistency, and generality. In contrast, Khazana provides scalable support for large numbers of replicas while giving applications considerable flexibility in trading off consistency for availability and performance. Its flexibility enables applications to effectively exploit inherent data locality while meeting consistency needs. Khazana exports a file system-like interface with a small set of consistency controls which can be combined to yield a broad spectrum of consistency flavors ranging from strong consistency to best-effort eventual consistency. Khazana servers form failure-resilient dynamic replica hierarchies to manage replicas across vari...
Performance problems in complex systems are often caused by under-provisioning, workload interfer... more Performance problems in complex systems are often caused by under-provisioning, workload interference, incorrect expectations or bugs. Troubleshooting such systems is a difficult task faced by service engineers. We have built CLUEBOX, a non-intrusive toolkit that aids rapid problem diagnosis. It employs machine learning techniques on the available performance logs to characterize workloads, predict performance and discover anomalous behavior. By identifying the most relevant anomalies to focus on, CLUEBOX automates the most onerous aspects of performance troubleshooting. We have experimentally validated our methodology in a networked storage environment with real workloads. Using CLUEBOX to learn from a set of historical performance observations, we were able to distill over 2000 performance counters into 68 counters that succinctly describe a running workload. Further, we demonstrate effective troubleshooting of two scenarios that adversely impacted application response time: (1) a...
Essentially all distributed systems, applications, and services at some level boil down to the pr... more Essentially all distributed systems, applications, and services at some level boil down to the problem of managing distributed shared state. Unfortunately, while the problem of managing distributed shared state is shared by many applications, there is no common means of managing the data { every application devises its own solution. We have developed Khazana, a distributed service exporting the abstraction of a distributed persistent globally shared store that applications can use to store their shared state. Khazana is responsible for performing many of the common operations needed by distributed applications, including replication, consistency management, fault recovery, access control, and location management. Using Khazana as a form of middleware, distributed applications can be quickly developed from corresponding uniprocessor applications through the insertion of Khazana data access and synchronization operations.
Coherent wide-area data caching can improve the scalability and responsiveness of distributed ser... more Coherent wide-area data caching can improve the scalability and responsiveness of distributed services such as wide-area file access, database and directory services, and content distribution. However, distributed services differ widely in the frequency of read/write sharing, the amount of contention between clients for the same data, and their ability to make tradeoffs between consistency and availability. Aggressive replication enhances the scalability and availability of services with read-mostly data or data that need not be kept strongly consistent. However, for applications that require strong consistency of writeshared data, you must throttle replication to achieve reasonable performance. We have developed a middleware data store called Swarm designed to support the widearea data sharing needs of distributed services. To support the needs of diverse distributed services, Swarm provides: (i) a failure-resilient proximity-aware data replication mechanism that adjusts the replic...
One of the most important services required by most distributed applications is some form of shar... more One of the most important services required by most distributed applications is some form of shared data management e g a directory service manages shared directory entries while groupware manages shared doc uments Each such application currently must im plement its own data management mechanisms be cause existing runtime systems are not exible enough to support all distributed applications e ciently For example groupware can be e ciently supported by a distributed object system while a distributed database would prefer a more low level storage abstraction The goal of Khazana is to provide programmer s with con g urable components that support the data management services required by a wide variety of distributed appli cations including consistent caching automated repli cation and migration of data persistence access con trol and fault tolerance It does so via a carefully de signed set of interfaces that support a hierarchy of data abstractions ranging from at data to C Java ob jec...
The lack of a flexible consistency management solution hinders P2P implementation of applications... more The lack of a flexible consistency management solution hinders P2P implementation of applications involving updates, such as directory services, online auctions and collaboration. Managing shared data in a P2P setting requires a consistency solution that can operate in a heterogenous network, support pervasive replication for scaling, and give peers autonomy to tune consistency to their sharing needs and resource constraints. Existing solutions lack one or more of these features. In this paper, we propose a new way to structure consistency management for P2P sharing of mutable data calledcomposable consistency . It lets applications compose a rich variety of consistency solutions appropriate for their sharing needs, out of a small set of primitive options. Our approach splits consistency management into design choices along five orthogonal aspects, namely, concurrency, consistency, availability, update visibility and isolation. Various combinations of these choices can be employed t...
Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization... more Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization and analysis. Though numerous powerful tools have been developed for linguistic analysis of Samskrit texts, employing them together on large document collections and building end-user applications is a challenge due to non-standard interfaces. This paper examines the architectural needs of scalable Indic document analytics, and presents our experience in building an actual system. Though it is a work in progress, we demonstrate how careful metadata design enabled us to rapidly develop useful applications via extensive reuse of state-of-the-art analysis tools. This paper offers an approach to standardization of linguistic analysis output, and lays out guidelines for Indic document metadata design and storage.
In this paper, we describe DataStations , an architecture that provides ubiquitous transient stor... more In this paper, we describe DataStations , an architecture that provides ubiquitous transient storage to arbitrary mobile applications. Mobile users can utilize a nearby DataStation as a proxy cache for their remote home file servers, as a file server to meet transient storage needs, and as a platform to share data and collaborate with other users over the wide area. A user can roam among DataStations, creating, updating and sharing files via a native file interface using a uniform file name space throughout. Our architecture provides transparent migration of file ownership and responsibility among DataStations and a user’s home file server. This design not only ensures file permanence, but also allows DataStations to reclaim their resources autonomously, allowing the system to incrementally scale to a large number of DataStations and users. The unique aspects of our DataStation design are its decentralized but uniform name space, its locality-aware peer replication mechanism, and it...
Storage infrastructure in large-scale cloud data center environments must support applications wi... more Storage infrastructure in large-scale cloud data center environments must support applications with diverse, time-varying data access patterns while observing the quality of service. Deeper storage hierarchies induced by solid state and rotating media are enabling new storage management tradeoffs that do not apply uniformly to all application phases at all times. To meet service level requirements in such heterogeneous application phases, storage management needs to be phase-aware and adaptive, i.e., to identify specific storage access patterns of applications as they occur and customize their handling accordingly. This paper presents LoadIQ, a novel, versatile, adaptive, application phase detector for networked (file and block) storage systems. In a live deployment, LoadIQ analyzes traces and emits phase labels learnt on the fly by using Support Vector Machines(SVM), a state of the art classifier. Such labels could be used to generate alerts or to trigger phase-specific system tuni...
International Symposium on Sanskrit Conputational Linguistics, 2019
Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization... more Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization and analysis. Numerous powerful tools and techniques have been developed for linguistic analysis of Samskrit and Indic language texts. However, the key challenge today is employing them together on large document collections and building higher level end-user applications to make Indic knowledge texts intelligible. We believe the chief hurdle is the lack of an end-to-end, secure, decentralized system platform for (i) composing independently developed tools for higher-level tasks, and (ii) employing human experts in the loop to work around the limitations of automated tools to ensure curated content always. Such a platform must define protocols and standards for interoperability and reusability of tools while enabling their autonomous evolution to spur innovation. This paper describes the architecture of an Internet platform for end-to-end Indic knowledge processing called Vedavaapi that addresses these challenges effectively. At its core, Vedavaapi is a community-sourced, scalable, multi-layered annotated object network. It serves as an overlay on Indic documents stored anywhere online by providing textifica-tion, language analysis and discourse analysis as value-added services in a crowd-sourced manner. It offers federated deployment of tools as microservices, powerful decentralized user / team management with access control across multiple organizational boundaries. social-media login and an open architecture with extensible and evolving object schemas. As its first application, we have developed human-assisted text conversion of handwritten manuscripts such as palm leaf etc leveraging several standards-based open-source tools including ones by IIIT Hyderabad, IIT Kanpur and University of Hyderabad. We demonstrate how our design choices enabled us to rapidly develop useful applications via extensive reuse of state-of-the-art analysis tools. This paper offers an approach to standardization of linguistic analysis output, and lays out guidelines for Indic document metadata design and storage.
Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization... more Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization and analysis. Numerous powerful tools and techniques have been developed for linguistic analysis of Samskrit and Indic language texts. However, the key challenge today is employing them together on large document collections and building higher level end-user applications to make Indic knowledge texts intelligible. We believe the chief hurdle is the lack of an end-to-end, secure, decentralized system platform for (i) composing independently developed tools for higher-level tasks, and (ii) employing human experts in the loop to work around the limitations of automated tools to ensure curated content always. Such a platform must define protocols and standards for interoperability and reusability of tools while enabling their autonomous evolution to spur innovation. This paper describes the architecture of an Internet platform for end-to-end Indic knowledge processing called Vedavaapi that...
Khazana is a peer-to-peer data service that supports efficient sharing and aggressive caching of ... more Khazana is a peer-to-peer data service that supports efficient sharing and aggressive caching of mutable data across the wide area while giving clients significant control over replica divergence. Previous work on wide-area replicated services focussed on at most two of the following three properties: aggressive replication, customizable consistency, and generality. In contrast, Khazana provides scalable support for large numbers of replicas while giving applications considerable flexibility in trading off consistency for availability and performance. Its flexibility enables applications to effectively exploit inherent data locality while meeting consistency needs. Khazana exports a file system-like interface with a small set of consistency controls which can be combined to yield a broad spectrum of consistency flavors ranging from strong consistency to best-effort eventual consistency. Khazana servers form failure-resilient dynamic replica hierarchies to manage replicas across vari...
Performance problems in complex systems are often caused by under-provisioning, workload interfer... more Performance problems in complex systems are often caused by under-provisioning, workload interference, incorrect expectations or bugs. Troubleshooting such systems is a difficult task faced by service engineers. We have built CLUEBOX, a non-intrusive toolkit that aids rapid problem diagnosis. It employs machine learning techniques on the available performance logs to characterize workloads, predict performance and discover anomalous behavior. By identifying the most relevant anomalies to focus on, CLUEBOX automates the most onerous aspects of performance troubleshooting. We have experimentally validated our methodology in a networked storage environment with real workloads. Using CLUEBOX to learn from a set of historical performance observations, we were able to distill over 2000 performance counters into 68 counters that succinctly describe a running workload. Further, we demonstrate effective troubleshooting of two scenarios that adversely impacted application response time: (1) a...
Essentially all distributed systems, applications, and services at some level boil down to the pr... more Essentially all distributed systems, applications, and services at some level boil down to the problem of managing distributed shared state. Unfortunately, while the problem of managing distributed shared state is shared by many applications, there is no common means of managing the data { every application devises its own solution. We have developed Khazana, a distributed service exporting the abstraction of a distributed persistent globally shared store that applications can use to store their shared state. Khazana is responsible for performing many of the common operations needed by distributed applications, including replication, consistency management, fault recovery, access control, and location management. Using Khazana as a form of middleware, distributed applications can be quickly developed from corresponding uniprocessor applications through the insertion of Khazana data access and synchronization operations.
Coherent wide-area data caching can improve the scalability and responsiveness of distributed ser... more Coherent wide-area data caching can improve the scalability and responsiveness of distributed services such as wide-area file access, database and directory services, and content distribution. However, distributed services differ widely in the frequency of read/write sharing, the amount of contention between clients for the same data, and their ability to make tradeoffs between consistency and availability. Aggressive replication enhances the scalability and availability of services with read-mostly data or data that need not be kept strongly consistent. However, for applications that require strong consistency of writeshared data, you must throttle replication to achieve reasonable performance. We have developed a middleware data store called Swarm designed to support the widearea data sharing needs of distributed services. To support the needs of diverse distributed services, Swarm provides: (i) a failure-resilient proximity-aware data replication mechanism that adjusts the replic...
One of the most important services required by most distributed applications is some form of shar... more One of the most important services required by most distributed applications is some form of shared data management e g a directory service manages shared directory entries while groupware manages shared doc uments Each such application currently must im plement its own data management mechanisms be cause existing runtime systems are not exible enough to support all distributed applications e ciently For example groupware can be e ciently supported by a distributed object system while a distributed database would prefer a more low level storage abstraction The goal of Khazana is to provide programmer s with con g urable components that support the data management services required by a wide variety of distributed appli cations including consistent caching automated repli cation and migration of data persistence access con trol and fault tolerance It does so via a carefully de signed set of interfaces that support a hierarchy of data abstractions ranging from at data to C Java ob jec...
The lack of a flexible consistency management solution hinders P2P implementation of applications... more The lack of a flexible consistency management solution hinders P2P implementation of applications involving updates, such as directory services, online auctions and collaboration. Managing shared data in a P2P setting requires a consistency solution that can operate in a heterogenous network, support pervasive replication for scaling, and give peers autonomy to tune consistency to their sharing needs and resource constraints. Existing solutions lack one or more of these features. In this paper, we propose a new way to structure consistency management for P2P sharing of mutable data calledcomposable consistency . It lets applications compose a rich variety of consistency solutions appropriate for their sharing needs, out of a small set of primitive options. Our approach splits consistency management into design choices along five orthogonal aspects, namely, concurrency, consistency, availability, update visibility and isolation. Various combinations of these choices can be employed t...
Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization... more Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization and analysis. Though numerous powerful tools have been developed for linguistic analysis of Samskrit texts, employing them together on large document collections and building end-user applications is a challenge due to non-standard interfaces. This paper examines the architectural needs of scalable Indic document analytics, and presents our experience in building an actual system. Though it is a work in progress, we demonstrate how careful metadata design enabled us to rapidly develop useful applications via extensive reuse of state-of-the-art analysis tools. This paper offers an approach to standardization of linguistic analysis output, and lays out guidelines for Indic document metadata design and storage.
In this paper, we describe DataStations , an architecture that provides ubiquitous transient stor... more In this paper, we describe DataStations , an architecture that provides ubiquitous transient storage to arbitrary mobile applications. Mobile users can utilize a nearby DataStation as a proxy cache for their remote home file servers, as a file server to meet transient storage needs, and as a platform to share data and collaborate with other users over the wide area. A user can roam among DataStations, creating, updating and sharing files via a native file interface using a uniform file name space throughout. Our architecture provides transparent migration of file ownership and responsibility among DataStations and a user’s home file server. This design not only ensures file permanence, but also allows DataStations to reclaim their resources autonomously, allowing the system to incrementally scale to a large number of DataStations and users. The unique aspects of our DataStation design are its decentralized but uniform name space, its locality-aware peer replication mechanism, and it...
Storage infrastructure in large-scale cloud data center environments must support applications wi... more Storage infrastructure in large-scale cloud data center environments must support applications with diverse, time-varying data access patterns while observing the quality of service. Deeper storage hierarchies induced by solid state and rotating media are enabling new storage management tradeoffs that do not apply uniformly to all application phases at all times. To meet service level requirements in such heterogeneous application phases, storage management needs to be phase-aware and adaptive, i.e., to identify specific storage access patterns of applications as they occur and customize their handling accordingly. This paper presents LoadIQ, a novel, versatile, adaptive, application phase detector for networked (file and block) storage systems. In a live deployment, LoadIQ analyzes traces and emits phase labels learnt on the fly by using Support Vector Machines(SVM), a state of the art classifier. Such labels could be used to generate alerts or to trigger phase-specific system tuni...
International Symposium on Sanskrit Conputational Linguistics, 2019
Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization... more Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization and analysis. Numerous powerful tools and techniques have been developed for linguistic analysis of Samskrit and Indic language texts. However, the key challenge today is employing them together on large document collections and building higher level end-user applications to make Indic knowledge texts intelligible. We believe the chief hurdle is the lack of an end-to-end, secure, decentralized system platform for (i) composing independently developed tools for higher-level tasks, and (ii) employing human experts in the loop to work around the limitations of automated tools to ensure curated content always. Such a platform must define protocols and standards for interoperability and reusability of tools while enabling their autonomous evolution to spur innovation. This paper describes the architecture of an Internet platform for end-to-end Indic knowledge processing called Vedavaapi that addresses these challenges effectively. At its core, Vedavaapi is a community-sourced, scalable, multi-layered annotated object network. It serves as an overlay on Indic documents stored anywhere online by providing textifica-tion, language analysis and discourse analysis as value-added services in a crowd-sourced manner. It offers federated deployment of tools as microservices, powerful decentralized user / team management with access control across multiple organizational boundaries. social-media login and an open architecture with extensible and evolving object schemas. As its first application, we have developed human-assisted text conversion of handwritten manuscripts such as palm leaf etc leveraging several standards-based open-source tools including ones by IIIT Hyderabad, IIT Kanpur and University of Hyderabad. We demonstrate how our design choices enabled us to rapidly develop useful applications via extensive reuse of state-of-the-art analysis tools. This paper offers an approach to standardization of linguistic analysis output, and lays out guidelines for Indic document metadata design and storage.
Uploads
Papers by Sai Susarla