LEXICAL ANALYZER

IRJET  Journal

LEXICAL ANALYZER

2022, IRJET

An The process of turning a string of letters into a string of tokens is known as lexical analysis, commonly referred to as lexing or tokenization. These tokens may be keywords, identifiers, constants, operators, or other language-specific symbols. The word lexical is obtained from the native word i.e. lexeme which means tokens. The process of lexical analysis usually includes reading each character of the input one by one, grouping characters into tokens, and passing these tokens to a parser or other program for further processing. Lexical analysis is often first step in the operation of compiling or interpreting a program. It is also used in natural language processing, information retrieval, and other fields where it is necessary to identify and classify the elements of a body of text. In general, lexical analysis involves breaking up a stream of text into a sequence of tokens, which can then be further processed and analyzed by other programs. It is an important step in the compilation and interpretation of programming languages, as well as in the processing of natural language.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 p-ISSN: 2395-0072 www.irjet.net LEXICAL ANALYZER Rushikesh Lakhotiya1, Mayuresh Chavan2, Satwik Divate3, Soham Pande4 Student, Dept. of Artificial Intelligence and Data Science, Vishwakarma Institute of Technology, Pune, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------1,2,3,4 in this example, the lexeme is necessary in addition to the token itself. Abstract - An The process of turning a string of letters into a string of tokens is known as lexical analysis, commonly referred to as lexing or tokenization. These tokens may be keywords, identifiers, constants, operators, or other language-specific symbols. The lexical analyzer normally functions independently and only uses one or two subprocesses and global variables to interact with rest of the compiler. Every time the parser requires a new token, it calls the lexical analyzer, which then delivers both the token and the lexeme that goes with it. The word lexical is obtained from the native word i.e. lexeme which means tokens. The process of lexical analysis usually includes reading each character of the input one by one, grouping characters into tokens, and passing these tokens to a parser or other program for further processing. The lexical anlyzer can be replaced or modified without im pacting the remaining compiler, because the real input is concealed from the parser. The lexical analyzer normally functions independently and only uses one or two subprocesses and global variables to interact with the rest of compiler. Lexical analysis is often first step in the operation of compiling or interpreting a program. It is also used in natural language processing, information retrieval, and other fields where it is necessary to identify and classify the elements of a body of text. When the parser requires a new token, it invokes the lexical analyzer, which then returns the token and its lexeme. The lexical analyzer may be changed or replaced without having an impact on the remainder of the compiler since actual input process is concealed from the parser. In general, lexical analysis involves breaking up a stream of text into a sequence of tokens, which can then be further processed and analyzed by other programs. It is an important step in the compilation and interpretation of programming languages, as well as in the processing of natural language. In this paper we study about the role of lexical analysis in the overall process of compiling or interpreting a program, Techniques for defining the tokens that a lexical analyzer should recognize, such as using regular expressions along with the implementation. Key Words: Lexical Analyzer, Lexeme, Compiler, Syntax analysis, Deterministic Finite Automata, Regular Expression, Compiler, Tokens, Parallel Tokenization, Multi-core Machines. 2. LITERATURE REVIEW Daniele Paolo Scarpazza et al. [1] A parallel regexp-based tokenization technique has been suggested that makes use of the substantial thread- and data-level parallelism offered by multi-core architecture. It is derived from the Deterministic Finite Automation (DFA) model, which was created for branch elimination and SIMDization in prediction-like applications. 1. INTRODUCTION The initial step in the compilation procedure is the lexical analyzer. This phase, also referred to as a lexical scanner, scans the input string without going back and reading each symbol more than once before fully processing it. The primary responsibility of lexical analyzer is to take input characters and generate the output of the token sequence that the parser utilises for lexical analysis. The lexical analyzer receives input characters from the parser and reads them until it can recognize the next token. Umarani Srikanth [2] The suggested approach divides the source programme into a predetermined number of blocks using a dynamic block splitter algorithm to carry out lexical analysis concurrently. By swiftly scanning through large dictionaries on multiple core IBM Cell processors for a string, the Aho-Corasick method tokenizes data. Tokens are generated from lexemes by a lexical analyzer. Internally, the tokens are frequently represented as distinct integers or an ordered type. In order to distinguish between the multiple name or numeric tokens © 2022, IRJET | Impact Factor value: 7.529 Swagat Kumar Jena et al. [3] According to the method, each block of the source file is divided into M lines, with | ISO 9001:2008 Certified Journal | Page 384 International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 p-ISSN: 2395-0072 www.irjet.net the possible exception of last block, and each block is stored in memory as a separate file. Then, N lexical programme threads are created, and lexical analysis is carried out simultaneously for each file using N lexical programme threads. dictionaries are tiny enough to fit into local memory space of the processing cores. Wuu Yang et al (2002) [9] identified the issue of the longest-match rule's applicability and proposed a model. The method consists of two steps: the first is to determine the regular set of token patterns produced by nondeterministic finite automaton while the automaton processes components of an input regular set, and the second is to determine whether a regular set and a free language have any non-trivial intersections with a set of equations. To enable parallel procedure model in the C programming language, Russell et al. (1992) created additions to the parallel procedural language and a runtime environment. In order to reduce the need for expensive process control blocks to be implemented, a novel method for nesting parallel process contexts in multiple stack frames is used in the run-time framework and the performance data for two parallel programs utilizing their proposed system is provided. Amit Barve et al. [4] stated The method based on an open source automatic lexer generator Flex and exploiting the concept of processor affinity and partitioning code written in C/C++ programming language based on for-loop looping structures. Amit Barve et al. [5] said, The method is based on setting the pivot points, which divides the code into a predetermined block count equal to the amount of available CPUs. Considerations were made for white space characters, various topologies, and pivot components based on lines. Amit Barve et al. [6] Modernized version the approach provided by Amit Barve et al. is concluded when of developing fast parallel lexers for multi-core processors. A proposed algorithm stores block indicators of source code in a text file that will later be read. Based on the read indicators, processes are branched and assigned to different CPUs using processor affinity. The algorithm's efficiency is increased by assigning operations to an available processor only when a process is formed. This method eliminates the need to wait for a process to be allocated to a processor. Xiaoyan Lai., (2014) [10] begins an innovative implementation approach for syntactic analysis, interpretative execution, and lexical analysis. The experimental analysis provides integrity and reliability of compilation system. Amit Barve et al., (2012) presented a new approach of implementing lexical analyzer to run in parallel which is based on an open source automatic lexer generator Flex and exploiting the concept of processor affinity. It is measured to be a simple and faster process by partitioning code written in C/C++ programming language based on for-loop looping structures. Work reasonably illustrates the benefit of multi-core architecture machines in accelerating the process of lexical analysis tasks. Thomas Reps et al., (1998) described the compilation domain, where tokenization process can always be carried out in time linear in the input size, while most of the standard tokenization algorithm explains that, in the worst case, the scanner can exhibit quadratic behavior for some sets of token definition Amit Barve et al., (2015) [7] explained an enhanced version for parallel lexical analysis algorithm. Furthermore, if the number of CPUs rises, the speed is seen as being higher. As a result, this approach can improve compilation time even more. The author asserts that the memory block-based approach exceeds the results of his earlier work, which used a round-robin CPU scheduling technique to run lexical analysis in parallel and the highest speed achieved is 6.84. The speed will be increased when number of CPU rises and also improves the overall compilation time. Daniele Paolo Scarpazza et al., (2007) investigated the importance of the efficiency of the cell processor system when it is used for the implementation of Deterministic Finite Automata based string matching process algorithms. 3. METHODOLOGY A) Proposed System There are several different approaches to performing lexical analysis, but a common methodology involves the following steps: Daniele Paolo Scarpazza et al., (2008) [8] the results of their experiment indicate that the Cell is the perfect candidate for managing security requirements. One cell processor has eight processing units, but only two of them have the processing ability to process a net connection with a data rate of even more than 10 Gbps. Using the AhoCorasick string searching algorithm, developed optimized string matching solutions for the Cell processor and the result showed a throughput of 40 Gbps per processor when The speed for bigger dictionaries is somewhere between 1.6 to 2.2 Gbps per processor, and the © 2022, IRJET | Impact Factor value: 7.529 | i. Read each character of the input one at a time. ii. Identify the next token in the input. This may involve looking for patterns in the characters, such as sequences of digits that form a number, or strings of letters that form a keyword or identifier. iii. Extract the token from the input. ISO 9001:2008 Certified Journal | Page 385 International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 p-ISSN: 2395-0072 www.irjet.net iv. Classify the token based on its type (e.g., keyword, identifier, constant, operator). v. Pass the token to a parser or other program for further processing. 4. RESULTS AND DISCUSSIONS Some lexical analyzers also include additional steps, such as removing comments or white space from the input, or performing preprocessing on the input before tokenization. It is also possible to use regular expressions or finite automata to perform lexical analysis. Regular expressions are a way of describing patterns in strings, and can be used to identify and extract tokens from the input. Finite automata are mathematical models that can be used to recognize patterns in input and are often used in the implementation of lexical analyzers. Fig -3: Sample Input Program In fig.3 : A sample C language program passed to the lexical analyzer. B) Flowchart Fig -4: Output In fig.4: The sample C program given as input is converted into tokens i.e. Keywords, Identifiers, Mathematical Operators, Logical Operators, Numerical Values, Other (Separators). Fig -1: Lexical Analyzer with the Parser 5. LIMITATIONS  The presence of an illegal character, often at the start of a token, results in a lexical mistake.  Some of the regular expressions are quite challenging to comprehend.  The lexer and also its token descriptions require more work to build and test. 6. CONCLUSION The goal of this study was to conduct a thorough examination of recent research on lexical analyzer implementation methodologies. It is known from the review that various software tools for lexical analyzers have been developed in the past that are ideally suited for serial execution. The different stages of the compilation process must be updated to accommodate multi-core architecture technologies as a result of the rise of multicore architecture systems in order to attain a parallelism in compilation tasks and thereby minimize the time of compilation. An extensive analysis provides a deeper insight towards the lexical analyzer. Among the results Fig -2: Dividing into Tokens © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 386 International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 p-ISSN: 2395-0072 www.irjet.net Proceedings Twenty-First Annual International Computer Software and Applications Conference (COMPSAC'97) (pp. 232-239). IEEE. recorded in the reviewed articles, it is observed, some of the high-level trends in scanner generation are the adaptation of parallel processing in lexical analysis tasks by using multi-core processor affinity principle to increase the efficiency of the compiler's runtime compared to the sequential execution of lexical analysis tasks on a single processor system. 7. FUTURE SCOPE The development of computing power is moving rapidly towards massive multi-core platform due to its power and performance benefits. System software, including compilers, should be designed for parallel processing in order to take full use of multi-core technology. Implementing more pattern matching algorithms into the program. [10] Barve, A., & Joshi, B. K. (2016). Fast parallel lexical analysis on multi-core machines. International Journal of High-Performance Computing and Networking, 9(3), 250- 257. [11] Scarpazza, D. P., Villa, O., & Petrini, F. (2007, March). Peak-performance DFA-based string matching on the Cell processor. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1-8). IEEE [12] Jena, S. K., Das, S., & Sahoo, S. P. (2018). Design and Development of a Parallel Lexical Analyzer for C Language. International Journal of Knowledge-Based Organizations (IJKBO), 8(1), 68-82. [13] Clapp, R. M., & Mudge, T. N. (1992). Parallel language constructs for efficient parallel processing. University of Michigan, Computer Science and Engineering Division, Department of Electrical Engineering and Computer Science. IEEE, 230-241. [14] Li, D. C., Cai, X. C., Han, C. Y., & Liu, Y. X. (2012). The Research and Analysis of Lexical Analyzer in Prolog Compiler. In Applied Mechanics and Materials (Vol. 229, pp. 1733-1737). Trans Tech Publications Ltd. [15] Maliavko, A. A. (2018, October). The Lexical and Syntactic Analyzers of the Translator for the EI Language. In 2018 XIV International ScientificTechnical Conference on Actual Problems of Electronics Instrument Engineering (APEIE) (pp. 360-364). IEEE. [16] Wang, X., Hong, Y., Chang, H., Park, K., Langdale, G., Hu, J., & Zhu, H. (2019). Hyperscan: a fast multipattern regex matcher for modern cpus. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19) (pp. 631-648). [17] Becchi, M., & Crowley, P. (2013). A-dfa: A timeand space-efficient dfa compression algorithm for fast regular expression evaluation. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 1-26. [18] Aithal, P. S., & Pai T, V. (2016). Concept of Ideal Software and its Realization Scenarios. International Journal of Scientific Research and Modern Education (IJSRME), 1(1), 826-837. [19] Ingale, Varad, Kuldeep Vayadande, Vivek Verma, Abhishek Yeole, Sahil Zawar, and Zoya Jamadar. "Lexical analyzer using DFA." International REFERENCES [1] Aho, A. V., Lam, M. S., & Sethi, R. (2009). Compilers Principles, Techniques and Tools, 2nd ed, PEARSON Education. [2] Lesk, M. E., & Schmidt, E. (1975). Lex: A lexical analyzer generator. Computing Science Technical Report No. 39, Bell Laboratories, Murray Hills, New Jersey. [3] Mickunas, M. D., & Schell, R. M. (1978, December). Parallel compilation in a multiprocessor environment. In Proceedings of the 1978 annual conference (pp. 241-246). [4] Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University, 33(1), 1-26. [5] Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS quarterly, 26(2), 8-23. [6] Glesner, S., Forster, S., & Jager, M. (2005). A program result checker for the lexical analysis of the gnu c compiler. Electronic Notes in Theoretical Computer Science, 132(1), 19-35. [7] Barve, A., & Joshi, B. K. (2012, September). A parallel lexical analyzer for multi-core machines. In 2012 CSI Sixth International Conference on Software Engineering (CONSEG) (pp. 1-3). IEEE. [8] Barve, A., & Joshi, B. K. (2013). Automatic C Code Generation for Parallel Compilation. International Journal on Advanced Computer Theory and Engineering (IJACTE), 2(4), 26-28. [9] Omori, Y., Joe, K., & Fukuda, A. (1997, August). A parallelizing compiler by object-oriented design. In © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 387 International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 p-ISSN: 2395-0072 www.irjet.net Journal of Advance Research, Ideas and Innovations in Technology, www. IJARIIT. com. [20] Chandra, Arunav, Aashay Bongulwar, Aayush Jadhav, Rishikesh Ahire, Amogh Dumbre, Sumaan Ali, Anveshika Kamble, Rohit Arole, Bijin Jiby, and Sukhpreet Bhatti. Survey on Randomly Generating English Sentences. No. 7655. EasyChair, 2022. [21] Manjramkar, Devang, Adwait Gharpure, Aayush Gore, Ishan Gujarathi, and Dhananjay Deore. "A Review Paper on Document text search based on nondeterministic automata." (2022). [22] Vayadande, Kuldeep, Neha Bhavar, Sayee Chauhan, Sushrut Kulkarni, Abhijit Thorat, and Yash Annapure. Spell Checker Model for String Comparison in Automata. No. 7375. EasyChair, 2022. [23] ayadande, Kuldeep, Ram Mandhana, Kaustubh Paralkar, Dhananjay Pawal, Siddhant Deshpande, and Vishal Sonkusale. "Pattern Matching in File System." International Journal of Computer Applications 975: 8887. [24] VAYADANDE, KULDEEP. "Simulating Derivations of Context-Free Grammar." (2022). [25] Vayadande, Kuldeep B., Parth Sheth, Arvind Shelke, Vaishnavi Patil, Srushti Shevate, and Chinmayee Sawakare. "Simulation and Testing of Deterministic Finite Automata Machine." International Journal of Computer Sciences and Engineering 10, no. 1 (2022): 13-17. © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 388

Log In

LEXICAL ANALYZER

Related papers

Related papers

Related topics