Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3373376.3378471acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Why GPUs are Slow at Executing NFAs and How to Make them Faster

Published: 13 March 2020 Publication History

Abstract

Non-deterministic Finite Automata (NFA) are space-efficient finite state machines that have significant applications in domains such as pattern matching and data analytics. In this paper, we investigate why the Graphics Processing Unit (GPU)---a massively parallel computational device with the highest memory bandwidth available on general-purpose processors---cannot efficiently execute NFAs. First, we identify excessive data movement in the GPU memory hierarchy and describe how to privatize reads effectively using GPU's on-chip memory hierarchy to reduce this excessive data movement. We also show that in several cases, indirect table lookups in NFAs can be eliminated by converting memory reads into computation, to further reduce the number of memory reads. Although our optimization techniques significantly alleviate these memory-related bottlenecks, a side effect of these techniques is the static assignment of work to cores. This leads to poor compute utilization, where GPU cores are wasted on idle NFA states. Therefore, we propose a new dynamic scheme that effectively balances compute utilization with reduced memory usage. Our combined optimizations provide a significant improvement over the previous state-of-the-art GPU implementations of NFAs. Moreover, they enable current GPUs to outperform the domain-specific accelerator for NFAs (i.e., Automata Processor) across several applications while performing within an order of magnitude for the rest of the applications.

References

[1]
Manel Abdellatif, Chamseddine Talhi, Abdelawahab Hamou-Lhadj, and Michel Dagenais. 2015. On the Use of Mobile GPU for Accelerating Malware Detection Using Trace Analysis. In Proceedings of the Symposium on Reliable Distributed Systems Workshop (SRDSW) .
[2]
Matteo Avalle, Fulvio Risso, and Riccardo Sisto. 2016. Scalable Algorithms for NFA Multi-Striding and NFA-Based Deep Packet Inspection on GPUs . IEEE/ACM Transactions on Networking (ToN) (2016).
[3]
Michela Becchi and Srihari Cadambi. 2007. Memory-Efficient Regular Expression Search Using State Merging. In Proceedings of the International Conference on Computer Communications (INFOCOM) .
[4]
Michela Becchi, Mark Franklin, and Patrick Crowley. 2008. A Workload for Evaluating Deep Packet Inspection Architectures. In Proceedings of the International Symposium on Workload Characterization (IISWC) .
[5]
Michela Becchi, Charlie Wiseman, and Patrick Crowley. 2009. Evaluating Regular Expression Matching Engines on Network and General Purpose Processors. In Proceedings of the Symposium on Architectures for Networking and Communications Systems (ANCS) .
[6]
Chunkun Bo, Vinh Dang, Elaheh Sadredini, and Kevin Skadron. 2018. Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 using Automata Processing across Different Platforms. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA) .
[7]
Chunkun Bo, Ke Wang, Jeffrey J. Fox, and Kevin Skadron. 2016. Entity Resolution Acceleration using the Automata Processor. In Proceedings of the International Conference on Big Data (BigData) .
[8]
Niccolo' Cascarano, Pierluigi Rolando, Fulvio Risso, and Riccardo Sisto. 2010. iNFAnt: NFA Pattern Matching on GPGPU Devices . SIGCOMM Computer Communication Review (CCR) (2010).
[9]
Yeim-Kuan Chang and Yu-Hao Tseng. 2016. Fast and Memory Efficient NFA Pattern Matching using GPU. In Proceedings of the International Conference on Communications, Computation, Networks and Technologies (INNOV) .
[10]
Russ Cox. 2007. Regular Expression Matching can be Simple and Fast . https://swtch.com/ rsc/regexp/regexp1.html .
[11]
Paul Dlugosch, Dave Brown, Paul Glendenning, Leventhal Leventhal, and Harold Noyes. 2014. An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing . IEEE Transactions on Parallel and Distributed Systems (TPDS) (2014).
[12]
Yuanwei Fang, Tung T. Hoang, Michela Becchi, and Andrew A. Chien. 2015. Fast Support for Unstructured Data Processing: The Unified Automata Processor. In Proceedings of the International Symposium on Microarchitecture (MICRO) .
[13]
Victor Mikhaylovich Glushkov. 1961. The Abstract Theory of Automata . Russian Mathematical Surveys, Vol. 16, 5 (1961), 1.
[14]
Timothy Heil, Anil Krishna, Nicholas Lindberg, Farnaz Toussi, and Steven Vanderwiel. 2014. Architecture and Performance of the Hardware Accelerators in IBM's PowerEN Processor . ACM Transactions on Parallel Computing (TOPC) (2014).
[15]
Peng Jiang and Gagan Agrawal. 2017. Combining SIMD and Many/Multi-Core Parallelism for Finite State Machines with Enumerative Speculation. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) .
[16]
HyunJin Kim and Kang-Il Choi. 2016. A Pipelined Non-Deterministic Finite Automaton-Based String Matching Scheme Using Merged State Transitions in an FPGA . PLOS ONE, Vol. 11 (2016).
[17]
Sailesh Kumar, Jonathan Turner, and John Williams. 2006. Advanced Algorithms for Fast and Scalable Deep Packet Inspection. In Proceedings of the Symposium on Architecture for Networking and Communications Systems (ANCS).
[18]
Hongyuan Liu, Mohamed Ibrahim, Onur Kayiran, Sreepathi Pai, and Adwait Jog. 2018. Architectural Support for Efficient Large-Scale Automata Processing. In Proceedings of the International Symposium on Microarchitecture (MICRO) .
[19]
Todd Mytkowicz, Madanlal Musuvathi, and Wolfram Schulte. 2014. Data-parallel Finite-state Machines. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .
[20]
Rupesh Nasre, Martin Burtscher, and Keshav Pingali. 2013. Data-Driven Versus Topology-driven Irregular Computations on GPUs. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS) .
[21]
Marziyeh Nourian, Xiang Wang, Xiaodong Yu, Wu-chun Feng, and Michela Becchi. 2017. Demystifying Automata Processing: GPUs, FPGAs or Micron's AP?. In Proceedings of the International Conference on Supercomputing (ICS) .
[22]
Marziyeh Nourian, Hancheng Wu, and Michela Becchi. 2018. A Compiler Framework for Fixed-Topology Non-Deterministic Finite Automata on SIMD Platforms. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS) .
[23]
Sreepathi Pai and Keshav Pingali. 2016. A Compiler for Throughput Optimization of Graph Algorithms on GPUs. In Proceedings of the International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) .
[24]
Junqiao Qiu, Zhijia Zhao, and Bin Ren. 2016. MicroSpec: Speculation-Centric Fine-Grained Parallelization for FSM Computations. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT) .
[25]
Junqiao Qiu, Zhijia Zhao, Bo Wu, Abhinav Vishnu, and Shuaiwen Leon Song. 2017. Enabling Scalability-sensitive Speculative Parallelization for FSM Computations. In Proceedings of the International Conference on Supercomputing (ICS). ACM.
[26]
Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO) .
[27]
Martin Roesch. 1999. Snort - Lightweight Intrusion Detection for Networks. In Proceedings of the USENIX Conference on System Administration (LISA) .
[28]
Indranil Roy and Srinivas Aluru. 2014. Finding Motifs in Biological Sequences Using the Micron Automata Processor. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS) .
[29]
Elaheh Sadredini, Deyuan Guo, Chunkun Bo, Reza Rahimi, Kevin Skadron, and Hongning Wang. 2018. A Scalable Solution for Rule-Based Part-of-Speech Tagging on Novel Hardware Accelerators. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) .
[30]
Elaheh Sadredini, Reza Rahimi, Lenjani Marzieh, Stan Mircea, and Skadron Kevin. 2020. Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA) .
[31]
Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mircea Stan, and Kevin Skadron. 2019 a. A Scalable and Efficient In-Memory Interconnect Architecture for Automata Processing . IEEE Computer Architecture Letters (CAL) (2019).
[32]
Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mircea Stan, and Kevin Skadron. 2019 b. eAP: A Scalable and Efficient in Memory Accelerator for Automata Processing. In Proceedings of the International Symposium on Microarchitecture (MICRO) .
[33]
Elaheh Sadredini, Reza Rahimi, Ke Wang, and Kevin Skadron. 2017. Frequent Subtree Mining on the Automata Processor: Challenges and Opportunities. In Proceedings of the International Conference on Supercomputing (ICS) .
[34]
Randy Smith, Cristian Estan, Somesh Jha, and Shijin Kong. 2008. Deflating the Big Bang: Fast and Scalable Deep Packet Inspection with Extended Finite Automata. In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM) .
[35]
Randy Smith, Neelam Goyal, Justin Ormont, Karthikeyan Sankaralingam, and Cristian Estan. 2009. Evaluating GPUs for network packet signature matching. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS) .
[36]
Arun Subramaniyan and Reetuparna Das. 2017. Parallel Automata Processor. In Proceedings of the International Symposium on Computer Architecture (ISCA) .
[37]
Arun Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, David Blaauw, Dennis Sylvester, and Reetuparna Das. 2017. Cache Automaton. In Proceedings of the International Symposium on Microarchitecture (MICRO) .
[38]
Andrew Todd, Marziyeh Nourian, and Michela Becchi. 2017. A Memory-Efficient GPU Method for Hamming and Levenshtein Distance Similarity. In Proceedings of the International Conference on High Performance Computing (HiPC) .
[39]
Tommy Tracy, Yao Fu, Indranil Roy, Eric Jonas, and Paul Glendenning. 2016. Towards machine learning on the Automata Processor. In Proceedings of the International Conference on High Performance Computing (HiPC) .
[40]
Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos P. Markatos, and Sotiris Ioannidis. 2008. Gnort: High Performance Network Intrusion Detection Using Graphics Processors. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection (RAID) .
[41]
Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. GASPP: A GPU-Accelerated Stateful Packet Processing Framework. In 2014 USENIX Annual Technical Conference (ATC) .
[42]
Giorgos Vasiliadis, Michalis Polychronakis, Spiros Antonatos, Evangelos P Markatos, and Sotiris Ioannidis. 2009. Regular Expression Matching on Graphics Hardware for Intrusion Detection. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection (RAID) .
[43]
Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis. 2011. Parallelization and characterization of pattern matching using GPUs. In Proceedings of the International Symposium on Workload Characterization (IISWC) .
[44]
Jack Wadden, Kevin Angstadt, and Kevin Skadron. 2018a. Characterizing and Mitigating Output Reporting Bottlenecks in Spatial Automata Processing Architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA) .
[45]
Jack Wadden, Vinh Dang, Nathan Brunelle, Tom Tracy II, Deyuan Guo, Elaheh Sadredini, Ke Wang, Chunkun Bo, Gabriel Robins, Mircea Stan, and Kevin Skadron. 2016. ANMLZoo: A Benchmark Suite for Exploring Bottlenecks in Automata Processing Engines and Architectures. In Proceedings of the International Symposium on Workload Characterization (IISWC) .
[46]
Jack Wadden and Kevin Skadron. 2016. VASim: An Open Virtual Automata Simulator for Automata Processing Application and Architecture Research. Technical Report CS2016-03. University of Virginia.
[47]
Jack Wadden, Tom Tracy II, Elaheh Sadredini, Lingzi Wu, Chunkun Bo, Jesse Du, Yizhou Wei, Matthew Wallace, Jeffrey Udall, Mircea Stan, and Kevin Skadron. 2018b. AutomataZoo: A Modern Automata Processing Benchmark Suite. In Proceedings of the International Symposium on Workload Characterization (IISWC) .
[48]
Ke Wang, Kevin Angstadt, Chunkun Bo, Nathan Brunelle, Elaheh Sadredini, Tommy Tracy, Jack Wadden, Mircea Stan, and Kevin Skadron. 2016. An Overview of Micron's Automata Processor. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES
[49]
[50]
Lei Wang, Shuhui Chen, Yong Tang, and Jinshu Su. 2011. Gregex: GPU Based High Speed Regular Expression Matching Engine. In Proceedings of the International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing .
[51]
Xiang Wang. 2014. Techniques for Efficient Regular Expression Matching across Hardware Architectures. Master's thesis. University of Missouri--Columbia.
[52]
Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI) .
[53]
Chengcheng Xu, Shuhui Chen, Jinshu Su, Siu Ming Yiu, and Lucas Chi Kwong Hui. 2016. A Survey on Regular Expression Matching for Deep Packet Inspection: Applications, Algorithms, and Hardware Platforms . IEEE Communications Surveys Tutorials (2016).
[54]
Liu Yang, Rezwana Karim, Vinod Ganapathy, and Randy Smith. 2010. Improving NFA-based Signature Matching using Ordered Binary Decision Diagrams. In International Workshop on Recent Advances in Intrusion Detection .
[55]
Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. 2006. Fast and Memory-efficient Regular Expression Matching for Deep Packet Inspection. In Proceedings of the Symposium on Architecture for Networking and Communications Systems (ANCS) .
[56]
Xiaodong Yu and Michela Becchi. 2013. GPU Acceleration of Regular Expression Matching for Large Datasets: Exploring the Implementation Space. In Proceedings of the International Conference on Computing Frontiers (CF) .
[57]
Zhijia Zhao and Xipeng Shen. 2015. On-the-Fly Principled Speculation for FSM Parallelization. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .
[58]
Zhijia Zhao, Bo Wu, and Xipeng Shen. 2014. Challenging the “Embarrassingly Sequential": Parallelizing Finite State Machine-based Computations Through Principled Speculation. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .
[59]
Keira Zhou, Jeffrey J. Fox, Ke Wang, Donald E. Brown, and Kevin Skadron. 2015a. Brill tagging on the Micron Automata Processor . In Proceedings of the International Conference on Semantic Computing (ICSC) .
[60]
Keira Zhou, Jack Wadden, Jeffrey J. Fox, Ke Wang, Donald E. Brown, and Kevin Skadron. 2015b. Regular Expression Acceleration on the Micron Automata Processor: Brill Tagging as a Case Study. In Proceedings of the International Conference on Big Data (BigData) .
[61]
Youwei Zhuo, Jinglei Cheng, Qinyi Luo, Jidong Zhai, Yanzhi Wang, Zhongzhi Luan, and Xuehai Qian. 2018. CSE: Convergence Set Based Enumerative FSM. In Proceedings of the International Symposium on Microarchitecture (MICRO) .
[62]
Yuan Zu, Ming Yang, Zhonghu Xu, Lin Wang, Xin Tian, Kunyang Peng, and Qunfeng Dong. 2012. GPU-based NFA Implementation for Memory Efficient High Speed Regular Expression Matching. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP) .

Cited By

View all
  • (2024)HybridSA: GPU Acceleration of Multi-pattern Regex Matching using Bit ParallelismProceedings of the ACM on Programming Languages10.1145/36897718:OOPSLA2(1699-1728)Online publication date: 8-Oct-2024
  • (2024)BVAP: Energy and Memory Efficient Automata Processing for Regular Expressions with Bounded RepetitionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640412(151-166)Online publication date: 27-Apr-2024
  • (2024)ngAP: Non-blocking Large-scale Automata Processing on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624848(268-285)Online publication date: 27-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2020
1412 pages
ISBN:9781450371025
DOI:10.1145/3373376
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. finite state machine
  2. gpu
  3. parallel computing

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '20

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)329
  • Downloads (Last 6 weeks)44
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)HybridSA: GPU Acceleration of Multi-pattern Regex Matching using Bit ParallelismProceedings of the ACM on Programming Languages10.1145/36897718:OOPSLA2(1699-1728)Online publication date: 8-Oct-2024
  • (2024)BVAP: Energy and Memory Efficient Automata Processing for Regular Expressions with Bounded RepetitionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640412(151-166)Online publication date: 27-Apr-2024
  • (2024)ngAP: Non-blocking Large-scale Automata Processing on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624848(268-285)Online publication date: 27-Apr-2024
  • (2024)One Automaton to Rule Them All: Beyond Multiple Regular Expressions ExecutionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444810(193-206)Online publication date: 2-Mar-2024
  • (2023)Asynchronous Automata Processing on GPUsACM SIGMETRICS Performance Evaluation Review10.1145/3606376.359352451:1(23-24)Online publication date: 27-Jun-2023
  • (2023)Search-Based Regular Expression Inference on a GPUProceedings of the ACM on Programming Languages10.1145/35912747:PLDI(1317-1339)Online publication date: 6-Jun-2023
  • (2023)Covering All the Bases: Type-Based Verification of Test Input GeneratorsProceedings of the ACM on Programming Languages10.1145/35912717:PLDI(1244-1267)Online publication date: 6-Jun-2023
  • (2023)An Automata-Based Framework for Verification and Bug Hunting in Quantum CircuitsProceedings of the ACM on Programming Languages10.1145/35912707:PLDI(1218-1243)Online publication date: 6-Jun-2023
  • (2023)ImageEye: Batch Image Processing using Program SynthesisProceedings of the ACM on Programming Languages10.1145/35912487:PLDI(686-711)Online publication date: 6-Jun-2023
  • (2023)CQS: A Formally-Verified Framework for Fair and Abortable SynchronizationProceedings of the ACM on Programming Languages10.1145/35912307:PLDI(244-266)Online publication date: 6-Jun-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media