Sean Blanchard

In this work we present SaNSA, the Supercomputer and Node State Architecture, a software infrastructure for historical analysis and anomaly detection. SaNSA consumes data from multiple sources including system logs, the resource manager,... more

In this work we present SaNSA, the Supercomputer and Node State Architecture, a software infrastructure for historical analysis and anomaly detection. SaNSA consumes data from multiple sources including system logs, the resource manager, scheduler, and job logs. Furthermore, additional context such as scheduled maintenance events or dedicated application run times for specific science teams can be overlaid. We discuss how this contextual information allows for more nuanced analysis. SaNSA allows the user to apply arbitrary attributes, for instance, positional information where nodes are located in a data center. We show how using this information we identify anomalous behavior of one rack of a 1,500 node cluster. We explain the design of SaNSA and then test it on four open compute clusters at LANL. We ingest over 1.1 billion lines of system logs in our study of 190 days in 2018. Using SaNSA, we perform a number of different anomaly detection methods and explain their findings in the context of a production supercomputing data center. For example, we report on instances of misconfigured nodes which receive no scheduled jobs for a period of time as well as examples of correlated rack failures which cause jobs to crash.

Publication Date: Nov 1, 2018

Publication Date: Nov 1, 2018

Research Interests: Computer Science and Supercomputer<div>()</div>

Publication Date: Aug 20, 2019

Research Interests: Computer Science<div>()</div>

Publisher: Springer Science+Business Media

Publication Date: 2012

Publication Name: Lecture Notes in Computer Science

Research Interests: Computer Science, Soft Computing, Virtual Machine, Fault Injection, and Soft Error<div>()</div>

Publication Date: Dec 1, 2013

Research Interests: Computer Science, Distributed Computing, Cloud Computing, Anomaly Detection, and Dependability<div>()</div>

Publisher: Springer Science+Business Media

Publication Date: 2014

Publication Name: Lecture Notes in Computer Science

Research Interests: Computer Science, Parallel Computing, Software Deployment, Supercomputer, Suite, and Test Suite<div>()</div>

Publication Date: Jun 1, 2021

Research Interests: Computer Science<div>()</div>

Publication Date: Sep 1, 2021

Research Interests: Computer Science, Reliability Engineering, Fault Tolerance, and DRAM<div>()</div>

Publication Date: Aug 27, 2018

Research Interests: Computer Science and Anomaly Detection<div>()</div>

Publication Date: Mar 1, 2013

Research Interests: Computer Science, Modeling, Reliability Engineering, Memory, Reliability, and ECC<div>()</div>

Publication Date: Aug 1, 2019

Research Interests: Computer Science, Distributed Computing, Data Mining, Anomaly Detection, Correctness, and Supercomputer<div>()</div>

Publication Date: Jun 1, 2014

Research Interests: Computer Science, Parallel Computing, Locality, Programmer, Powerpc, and Programming Paradigm<div>()</div>

Publication Date: Jun 1, 2018

Research Interests: Computer Science and Anomaly Detection<div>()</div>

Publication Date: Nov 17, 2013

Research Interests: Computer Science, Parallel Computing, and Supercomputer<div>()</div>

Publication Date: Jun 1, 2017

Research Interests: Computer Science, Register File, and Soft Error<div>()</div>

Publication Date: Jun 15, 2015

Research Interests: Computer Science, Algorithm, and Soft Error<div>()</div>

Publication Date: Dec 1, 2016

Research Interests: Computer Science, Cluster Analysis, Dbscan, and Supercomputer<div>()</div>

Publication Date: Dec 1, 2013

Research Interests: Computer Science, Modeling, Reliability Engineering, Reliability, DRAM, and Monte Carlo Method<div>()</div>

Publication Date: May 1, 2014

Publisher: Springer Nature

Publication Date: 2014

Publication Name: Springer eBooks

Research Interests: Computer Science, Distributed Computing, Architecture, and Springer Ebooks<div>()</div>

Publication Date: Jun 1, 2016

Research Interests: Computer Science<div>()</div>

Publisher: Association for Computing Machinery

Publication Date: Jan 26, 2017

Publication Name: Sigplan Notices

Publication Date: Nov 1, 2018

Research Interests: Computer Science<div>()</div>

Publication Date: Mar 14, 2015

Research Interests: Computer Science, Reliability Engineering, DRAM, Embedded System, and Static random access memory<div>()</div>

Publication Date: Jul 1, 2011

Publication Date: Aug 22, 2016

Publication Name: Simulation Tools and Techniques for Communications, Networks and System

Publisher: Institute of Electrical and Electronics Engineers

Publication Date: Jun 1, 2012

Publication Name: IEEE Transactions on Device and Materials Reliability

Publication Date: Mar 1, 2019

Research Interests: Environmental Science, Computer Science, Proton, and Supercomputer<div>()</div>

Publisher: IGI Global

Publication Date: 2016

Publication Name: Advances in systems analysis, software engineering, and high performance computing book series

Research Interests: Engineering, Computer Science, Fault Injection, Robustness (evolution), and Soft Error<div>()</div>

Publication Date: Jul 25, 2014

Research Interests: Computer Science, Linux Kernel, Memory Management, OPERATING SYSTEM, Supercomputer, and Petabyte<div>()</div>

Publisher: Elsevier BV

Publication Date: Apr 1, 2018

Publication Name: Parallel Computing

Research Interests: Cognitive Science, Computer Science, Distributed Computing, Parallel Computing, Virtualization, and Reliability Engineering<div>()</div>

Publication Date: Dec 1, 2016

Research Interests: Computer Science, Relational Database, and Anomaly Detection<div>()</div>

Publisher: IEEE

Publication Name: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

Research Interests: Computer Science and Anomaly Detection<div>()</div>

Publisher: Morressier

Research Interests: Radiography<div>()</div>

Publisher: IEEE

Publication Name: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

Research Interests:
Computer Science and Supercomputer

Research Interests:
Computer Science

Research Interests:
Computer Science, Soft Computing, Virtual Machine, Fault Injection, and Soft Error

Research Interests:
Computer Science, Distributed Computing, Cloud Computing, Anomaly Detection, and Dependability

Research Interests:
Computer Science, Parallel Computing, Software Deployment, Supercomputer, Suite, and Test Suite

Research Interests:
Computer Science

Research Interests:
Computer Science, Reliability Engineering, Fault Tolerance, and DRAM

Research Interests:
Computer Science and Anomaly Detection

Research Interests:
Computer Science, Modeling, Reliability Engineering, Memory, Reliability, and ECC

Research Interests:
Computer Science, Distributed Computing, Data Mining, Anomaly Detection, Correctness, and Supercomputer

Research Interests:
Computer Science, Parallel Computing, Locality, Programmer, Powerpc, and Programming Paradigm

Research Interests:
Computer Science and Anomaly Detection

Research Interests:
Computer Science, Parallel Computing, and Supercomputer

Research Interests:
Computer Science, Register File, and Soft Error

Research Interests:
Computer Science, Algorithm, and Soft Error

Research Interests:
Computer Science, Cluster Analysis, Dbscan, and Supercomputer

Research Interests:
Computer Science, Modeling, Reliability Engineering, Reliability, DRAM, and Monte Carlo Method

Research Interests:
Computer Science, Distributed Computing, Architecture, and Springer Ebooks

Research Interests:
Computer Science

Research Interests:
Computer Science

Research Interests:
Computer Science, Reliability Engineering, DRAM, Embedded System, and Static random access memory

Research Interests:
Environmental Science, Computer Science, Proton, and Supercomputer

Research Interests:
Engineering, Computer Science, Fault Injection, Robustness (evolution), and Soft Error

Research Interests:
Computer Science, Linux Kernel, Memory Management, OPERATING SYSTEM, Supercomputer, and Petabyte

Research Interests:
Cognitive Science, Computer Science, Distributed Computing, Parallel Computing, Virtualization, and Reliability Engineering

Research Interests:
Computer Science, Relational Database, and Anomaly Detection

Research Interests:
Computer Science and Anomaly Detection

Research Interests:
Radiography

Research Interests:
Computer Science, Distributed Computing, Data Mining, Anomaly Detection, Correctness, and Supercomputer

Research Interests:
Environmental Science, Computer Science, Proton, and Supercomputer

Research Interests:
Computer Science and Anomaly Detection

Research Interests:
Computer Science

Research Interests:
Nuclear Engineering, Computer Science, IEEE, Neutron Detection, and Neutron

Research Interests:
Computer Science, Predictability, Locality, DRAM, and Supercomputer

Research Interests:
Computer Science, Register File, and Soft Error

Research Interests:
Computer Science

Research Interests:
Computer Science, Reliability Engineering, Fault Tolerance, and DRAM

Research Interests:
Computer Science

Research Interests:
Computer Science, Distributed Computing, Supercomputing, Neutron, and Supercomputer

Research Interests:
Cognitive Science, Computer Science, Distributed Computing, Parallel Computing, Virtualization, and Reliability Engineering