Skip to main content

Sardar Jaf

The University of Manchester, Computer Science, Post-Doc

Followers

41

Following

7

Co-authors

2

Public Views

Sardar is Computational Linguist, he has PhD in Natural Language Processing and currently researching in topics related to Computer Vision, Robitics and Natural language.

less

InterestsView All (12)

Uploads

Papers by Sardar Jaf

An Exploration of Dropout with RNNs for Natural Language Inference

Springer, 2018

Dropout is a crucial regularization technique for the Recurrent Neural Network (RNN) models of Na... more Dropout is a crucial regularization technique for the Recurrent Neural Network (RNN) models of Natural Language Inference (NLI). However, dropout has not been evaluated for the effectiveness at different layers and dropout rates in NLI models. In this paper, we propose a novel RNN model for NLI and empirically evaluate the effect of applying dropout at different layers in the model. We also investigate the impact of varying dropout rates at these layers. Our empirical evaluation on a large (Stanford Natural Language Inference (SNLI)) and a small (SciTail) dataset suggest that dropout at each feed-forward connection severely affects the model accuracy at increasing dropout rates. We also show that regularizing the embedding layer is efficient for SNLI whereas regularizing the recurrent layer improves the accuracy for Sci-Tail. Our model achieved an accuracy 86.14% on the SNLI dataset and 77.05% on SciTail.

Combining Machine Learning Classifiers for the Task of Arabic Characters Recognition

International Journal of Asian Language Processing, 2018

There is a number of machine learning algorithms for recognizing Arabic characters. In this paper... more There is a number of machine learning algorithms for recognizing Arabic characters. In this paper, we investigate a range of strategies for multiple machine learning algorithms for the task of Arabic characters recognition, where we are faced with imperfect and dimensionally variable input characters. We show two different strategies to combining multiple machine learning algorithms: manual backoff strategry and ensemble learning strategy. We show the performance of using individual algorithms and combined algorithms on recognizing Arabic characters. Experimental results show that combined confidence-based strategies can produce more accurate results than each algorithm produces by itself and even the ones exhibited by the majority voting combination.

CAM: A Combined Attention Model for Natural Language Inference

IEEE Journal on Robotics and Automation, 2018

Natural Language Inference (NLI) is a fundamental step towards natural language understanding. Th... more Natural Language Inference (NLI) is a fundamental step towards natural language understanding. The task aims to detect whether a premise entails or contradicts a given hypothesis. NLI contributes to a wide range of natural language understanding applications such as question answering, text summarization and information extraction. Recently, the public availability of big datasets such as Stanford Natural Language Inference (SNLI) and SciTail, has made it feasible to train complex neural NLI models. Particularly, Bidirectional Long Short-Term Memory networks (BiLSTMs) with attention mechanisms have shown promising performance for NLI. In this paper, we propose a Combined Attention Model (CAM) for NLI. CAM combines the two attention mechanisms: intra-attention and inter-attention. The model first captures the semantics of the individual input premise and hypothesis with intra-attention and then aligns the premise and hypothesis with inter-sentence attention. We evaluate CAM on two benchmark datasets: Stanford Natural Language Inference (SNLI) and SciTail, achieving 86.14% accuracy on SNLI and 77.23% on SciTail. Further, to investigate the effectiveness of individual attention mechanism and in combination with each other, we present an analysis showing that the intra-and inter-attention mechanisms achieve higher accuracy when they are combined together than when they are independently used.

SPECIAL SECTION ON CYBER-THREATS AND COUNTERMEASURES IN THE HEALTHCARE SECTOR BotDet: A System for Real Time Botnet Command and Control Traffic Detection

IEEE, 2018

Over the past decade, the digitization of services transformed the healthcare sector leading to a... more Over the past decade, the digitization of services transformed the healthcare sector leading to a sharp rise in cybersecurity threats. Poor cybersecurity in the healthcare sector, coupled with high value of patient records attracted the attention of hackers. Sophisticated advanced persistent threats and malware have significantly contributed to increasing risks to the health sector. Many recent attacks are attributed to the spread of malicious software, e.g., ransomware or bot malware. Machines infected with bot malware can be used as tools for remote attack or even cryptomining. This paper presents a novel approach, called BotDet, for botnet Command and Control (C&C) traffic detection to defend against malware attacks in critical ultrastructure systems. There are two stages in the development of the proposed system: 1) we have developed four detection modules to detect different possible techniques used in botnet C&C communications and 2) we have designed a correlation framework to reduce the rate of false alarms raised by individual detection modules. Evaluation results show that BotDet balances the true positive rate and the false positive rate with 82.3% and 13.6%, respectively. Furthermore, it proves BotDet capability of real time detection. INDEX TERMS Critical infrastructure security, healthcare cyber attacks, malware, botnet, command and control server, intrusion detection system, alert correlation.

On the Development of a Large Scale Corpus for Native Language Identification

Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), 2018

Native Language Identification (NLI) is the task of identifying an author's native language from ... more Native Language Identification (NLI) is the task of identifying an author's native language from their writings in a second language. In this paper, we introduce a new corpus (italki), which is larger than the current corpora. It can be used for training machine learning based systems for classifying and identifying the native language of authors of English text. To examine the usefulness of italki, we evaluate it by using it to train and test some of the well performing NLI systems presented in the 2017 NLI shared task. In this paper, we present some aspects of italki. We show the impact of the variation of italki's training dataset size of some languages on systems performance. From our empirical finding, we highlight the potential of italki as a large scale corpus for training machine learning classifiers for classifying the native language of authors from their written English text. We obtained promising results that show the potential of italki to improve the performance of current NLI systems. More importantly, we found that training the current NLI systems on italki generalize better than training them on the current corpora.

Security threats to critical infrastructure: the human factor

Springer, 2018

In the twenty-first century, globalisation made corporate boundaries invisible and difficult to m... more In the twenty-first century, globalisation made corporate boundaries invisible and difficult to manage. This new macroeconomic transformation caused by globalisation introduced new challenges for critical infrastructure management. By 123 Security threats to critical infrastructure: the human… 4987 replacing manual tasks with automated decision making and sophisticated technology , no doubt we feel much more secure than half a century ago. As the technological advancement takes root, so does the maturity of security threats. It is common that today's critical infrastructures are operated by non-computer experts, e.g. nurses in health care, soldiers in military or firefighters in emergency services. In such challenging applications, protecting against insider attacks is often neither feasible nor economically possible, but these threats can be managed using suitable risk management strategies. Security technologies, e.g. firewalls, help protect data assets and computer systems against unauthorised entry. However, one area which is often largely ignored is the human factor of system security. Through social engineering techniques, malicious attackers are able to breach organisational security via people interactions. This paper presents a security awareness training framework, which can be used to train operators of critical infrastructure, on various social engineering security threats such as spear phishing, baiting, pretexting, among others.

An Exploration of Dropout with RNNs for Natural Language Inference

CAM: A Combined Attention Model for Natural Language Inference

by Sardar Jaf and Steven Bradley

IEEE International Conference on Big Data, Seattle, 2018

Natural Language Inference (NLI) is a fundamental step towards natural language understanding. Th... more Natural Language Inference (NLI) is a fundamental step towards natural language understanding. The task aims to detect whether a premise entails or contradicts a given hypothesis. NLI contributes to a wide range of natural language understanding applications such as question answering, text summarization and information extraction. Recently, the public availability of big datasets such as Stanford Natural Language Inference (SNLI) and SciTail, has made it feasible to train complex neural NLI models. Particularly, Bidirectional Long Short-Term Memory networks (BiLSTMs) with attention mechanisms have shown promising performance for NLI. In this paper, we propose a Combined Attention Model (CAM) for NLI. CAM combines the two attention mechanisms: intra-attention and inter-attention. The model first captures the semantics of the individual input premise and hypothesis with intra-attention and then aligns the premise and hypothesis with inter-sentence attention. We evaluate CAM on two benchmark datasets: Stanford Natural Language Inference (SNLI) and SciTail, achieving 86.14% accuracy on SNLI and 77.23% on SciTail. Further, to investigate the effectiveness of individual attention mechanism and in combination with each other, we present an analysis showing that the intra-and inter-attention mechanisms achieve higher accuracy when they are combined together than when they are independently used.

How Can Cryptosystem and the Use of Dongles Improve the Security of Backup Files on Servers?

This project focuses on finding the possibilities of using current advanced cryptosystems and don... more This project focuses on finding the possibilities of using current advanced cryptosystems and dongles to improve the security of stored backup files on allocated remote servers. Secret key and public key cryptography are discussed A number of advanced cryptosystems are identified, and their advantages and disadvantages are highlighted. A number of issues with using dongles to protect file security are identified, and the strengths of the chosen dongle for the prototype system are highlighted. A product prototype is deve loped to assess the contribution of cryptosystems and dongles to improve file security. The findings from the literature review show that cryptosystem can be used toimprove file security. However, the findings indicate that dongles may not be a viable solution to protect files from unauthorised access due to general issues associated with dongle-protected software. The results from the prototype system indicate that using cryptosystems are beneficial to improving fil...

An investigation into the use of Content Management Systems within organisations

Content Management Systems are used to enable content authors to publish or update information on... more Content Management Systems are used to enable content authors to publish or update information on the organization’s website without the need for web programming skills or help of a technical person. During a one year placement with Prescription Pricing Division, it was no ticed that the main responsibility of web developers was to publish or update information on the organization website and intranet; contents are created by authors of department administrators and forwarded to web development team to publish it. The process of publishing and updating information onto organisation’s intranet in that way, proved time consuming to both web developers and content authors, therefore a requirement was identified to develop a prototype intranet to empower content authors to publish information onto the organisation’s intranet without the need to web programming skills or help of a technical person. In this project, the author investigates various issues related to Content Management Syst...

The Hybridisation of a Data-driven parser for Natural Languages

The Application of Constraint Rules to Data-driven Parsing

by Allan Ramsay and Sardar Jaf

In this paper, we show an approach to extracting different types of constraint rules from a depen... more In this paper, we show an approach to extracting different types of constraint rules from a dependency treebank. Also, we show an approach to integrating these constraint rules into a dependency data-driven parser, where these constraint rules inform parsing decisions in specific situations where a set of parsing rule (which is induced from a classifier) may recommend several recommendations to the parser. Our experiments have shown that parsing accuracy could be improved by using different sets of constraint rules in combination with a set of parsing rules. Our parser is based on the arc-standard algorithm of MaltParser but with a number of extensions, which we will discuss in some detail.

Parser Hybridisation for Natural Languages

Towards the Development of a Hybrid Parser for Natural Languages

An Exploration of Dropout with RNNs for Natural Language Inference

Springer, 2018

Dropout is a crucial regularization technique for the Recurrent Neural Network (RNN) models of Na... more Dropout is a crucial regularization technique for the Recurrent Neural Network (RNN) models of Natural Language Inference (NLI). However, dropout has not been evaluated for the effectiveness at different layers and dropout rates in NLI models. In this paper, we propose a novel RNN model for NLI and empirically evaluate the effect of applying dropout at different layers in the model. We also investigate the impact of varying dropout rates at these layers. Our empirical evaluation on a large (Stanford Natural Language Inference (SNLI)) and a small (SciTail) dataset suggest that dropout at each feed-forward connection severely affects the model accuracy at increasing dropout rates. We also show that regularizing the embedding layer is efficient for SNLI whereas regularizing the recurrent layer improves the accuracy for Sci-Tail. Our model achieved an accuracy 86.14% on the SNLI dataset and 77.05% on SciTail.

Combining Machine Learning Classifiers for the Task of Arabic Characters Recognition

International Journal of Asian Language Processing, 2018

There is a number of machine learning algorithms for recognizing Arabic characters. In this paper... more There is a number of machine learning algorithms for recognizing Arabic characters. In this paper, we investigate a range of strategies for multiple machine learning algorithms for the task of Arabic characters recognition, where we are faced with imperfect and dimensionally variable input characters. We show two different strategies to combining multiple machine learning algorithms: manual backoff strategry and ensemble learning strategy. We show the performance of using individual algorithms and combined algorithms on recognizing Arabic characters. Experimental results show that combined confidence-based strategies can produce more accurate results than each algorithm produces by itself and even the ones exhibited by the majority voting combination.

CAM: A Combined Attention Model for Natural Language Inference

IEEE Journal on Robotics and Automation, 2018

Natural Language Inference (NLI) is a fundamental step towards natural language understanding. Th... more Natural Language Inference (NLI) is a fundamental step towards natural language understanding. The task aims to detect whether a premise entails or contradicts a given hypothesis. NLI contributes to a wide range of natural language understanding applications such as question answering, text summarization and information extraction. Recently, the public availability of big datasets such as Stanford Natural Language Inference (SNLI) and SciTail, has made it feasible to train complex neural NLI models. Particularly, Bidirectional Long Short-Term Memory networks (BiLSTMs) with attention mechanisms have shown promising performance for NLI. In this paper, we propose a Combined Attention Model (CAM) for NLI. CAM combines the two attention mechanisms: intra-attention and inter-attention. The model first captures the semantics of the individual input premise and hypothesis with intra-attention and then aligns the premise and hypothesis with inter-sentence attention. We evaluate CAM on two benchmark datasets: Stanford Natural Language Inference (SNLI) and SciTail, achieving 86.14% accuracy on SNLI and 77.23% on SciTail. Further, to investigate the effectiveness of individual attention mechanism and in combination with each other, we present an analysis showing that the intra-and inter-attention mechanisms achieve higher accuracy when they are combined together than when they are independently used.

SPECIAL SECTION ON CYBER-THREATS AND COUNTERMEASURES IN THE HEALTHCARE SECTOR BotDet: A System for Real Time Botnet Command and Control Traffic Detection

IEEE, 2018

Over the past decade, the digitization of services transformed the healthcare sector leading to a... more Over the past decade, the digitization of services transformed the healthcare sector leading to a sharp rise in cybersecurity threats. Poor cybersecurity in the healthcare sector, coupled with high value of patient records attracted the attention of hackers. Sophisticated advanced persistent threats and malware have significantly contributed to increasing risks to the health sector. Many recent attacks are attributed to the spread of malicious software, e.g., ransomware or bot malware. Machines infected with bot malware can be used as tools for remote attack or even cryptomining. This paper presents a novel approach, called BotDet, for botnet Command and Control (C&C) traffic detection to defend against malware attacks in critical ultrastructure systems. There are two stages in the development of the proposed system: 1) we have developed four detection modules to detect different possible techniques used in botnet C&C communications and 2) we have designed a correlation framework to reduce the rate of false alarms raised by individual detection modules. Evaluation results show that BotDet balances the true positive rate and the false positive rate with 82.3% and 13.6%, respectively. Furthermore, it proves BotDet capability of real time detection. INDEX TERMS Critical infrastructure security, healthcare cyber attacks, malware, botnet, command and control server, intrusion detection system, alert correlation.

On the Development of a Large Scale Corpus for Native Language Identification

Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), 2018

Native Language Identification (NLI) is the task of identifying an author's native language from ... more Native Language Identification (NLI) is the task of identifying an author's native language from their writings in a second language. In this paper, we introduce a new corpus (italki), which is larger than the current corpora. It can be used for training machine learning based systems for classifying and identifying the native language of authors of English text. To examine the usefulness of italki, we evaluate it by using it to train and test some of the well performing NLI systems presented in the 2017 NLI shared task. In this paper, we present some aspects of italki. We show the impact of the variation of italki's training dataset size of some languages on systems performance. From our empirical finding, we highlight the potential of italki as a large scale corpus for training machine learning classifiers for classifying the native language of authors from their written English text. We obtained promising results that show the potential of italki to improve the performance of current NLI systems. More importantly, we found that training the current NLI systems on italki generalize better than training them on the current corpora.

Security threats to critical infrastructure: the human factor

Springer, 2018

In the twenty-first century, globalisation made corporate boundaries invisible and difficult to m... more In the twenty-first century, globalisation made corporate boundaries invisible and difficult to manage. This new macroeconomic transformation caused by globalisation introduced new challenges for critical infrastructure management. By 123 Security threats to critical infrastructure: the human… 4987 replacing manual tasks with automated decision making and sophisticated technology , no doubt we feel much more secure than half a century ago. As the technological advancement takes root, so does the maturity of security threats. It is common that today's critical infrastructures are operated by non-computer experts, e.g. nurses in health care, soldiers in military or firefighters in emergency services. In such challenging applications, protecting against insider attacks is often neither feasible nor economically possible, but these threats can be managed using suitable risk management strategies. Security technologies, e.g. firewalls, help protect data assets and computer systems against unauthorised entry. However, one area which is often largely ignored is the human factor of system security. Through social engineering techniques, malicious attackers are able to breach organisational security via people interactions. This paper presents a security awareness training framework, which can be used to train operators of critical infrastructure, on various social engineering security threats such as spear phishing, baiting, pretexting, among others.

An Exploration of Dropout with RNNs for Natural Language Inference

CAM: A Combined Attention Model for Natural Language Inference

by Sardar Jaf and Steven Bradley

IEEE International Conference on Big Data, Seattle, 2018

Natural Language Inference (NLI) is a fundamental step towards natural language understanding. Th... more Natural Language Inference (NLI) is a fundamental step towards natural language understanding. The task aims to detect whether a premise entails or contradicts a given hypothesis. NLI contributes to a wide range of natural language understanding applications such as question answering, text summarization and information extraction. Recently, the public availability of big datasets such as Stanford Natural Language Inference (SNLI) and SciTail, has made it feasible to train complex neural NLI models. Particularly, Bidirectional Long Short-Term Memory networks (BiLSTMs) with attention mechanisms have shown promising performance for NLI. In this paper, we propose a Combined Attention Model (CAM) for NLI. CAM combines the two attention mechanisms: intra-attention and inter-attention. The model first captures the semantics of the individual input premise and hypothesis with intra-attention and then aligns the premise and hypothesis with inter-sentence attention. We evaluate CAM on two benchmark datasets: Stanford Natural Language Inference (SNLI) and SciTail, achieving 86.14% accuracy on SNLI and 77.23% on SciTail. Further, to investigate the effectiveness of individual attention mechanism and in combination with each other, we present an analysis showing that the intra-and inter-attention mechanisms achieve higher accuracy when they are combined together than when they are independently used.

How Can Cryptosystem and the Use of Dongles Improve the Security of Backup Files on Servers?

This project focuses on finding the possibilities of using current advanced cryptosystems and don... more This project focuses on finding the possibilities of using current advanced cryptosystems and dongles to improve the security of stored backup files on allocated remote servers. Secret key and public key cryptography are discussed A number of advanced cryptosystems are identified, and their advantages and disadvantages are highlighted. A number of issues with using dongles to protect file security are identified, and the strengths of the chosen dongle for the prototype system are highlighted. A product prototype is deve loped to assess the contribution of cryptosystems and dongles to improve file security. The findings from the literature review show that cryptosystem can be used toimprove file security. However, the findings indicate that dongles may not be a viable solution to protect files from unauthorised access due to general issues associated with dongle-protected software. The results from the prototype system indicate that using cryptosystems are beneficial to improving fil...

An investigation into the use of Content Management Systems within organisations

Content Management Systems are used to enable content authors to publish or update information on... more Content Management Systems are used to enable content authors to publish or update information on the organization’s website without the need for web programming skills or help of a technical person. During a one year placement with Prescription Pricing Division, it was no ticed that the main responsibility of web developers was to publish or update information on the organization website and intranet; contents are created by authors of department administrators and forwarded to web development team to publish it. The process of publishing and updating information onto organisation’s intranet in that way, proved time consuming to both web developers and content authors, therefore a requirement was identified to develop a prototype intranet to empower content authors to publish information onto the organisation’s intranet without the need to web programming skills or help of a technical person. In this project, the author investigates various issues related to Content Management Syst...

The Hybridisation of a Data-driven parser for Natural Languages

The Application of Constraint Rules to Data-driven Parsing

by Allan Ramsay and Sardar Jaf

In this paper, we show an approach to extracting different types of constraint rules from a depen... more In this paper, we show an approach to extracting different types of constraint rules from a dependency treebank. Also, we show an approach to integrating these constraint rules into a dependency data-driven parser, where these constraint rules inform parsing decisions in specific situations where a set of parsing rule (which is induced from a classifier) may recommend several recommendations to the parser. Our experiments have shown that parsing accuracy could be improved by using different sets of constraint rules in combination with a set of parsing rules. Our parser is based on the arc-standard algorithm of MaltParser but with a number of extensions, which we will discuss in some detail.

Parser Hybridisation for Natural Languages

Towards the Development of a Hybrid Parser for Natural Languages