Abstract
Tasks related to binary data formats include parsing, generating, and conjoint code and data analysis. A key element for all of these tasks is a universal data format model. An approach to modeling binary data formats is proposed. The described model has sufficient expressive power for specifying the majority of widespread data formats. A distinctive feature of this model is its flexibility in specifying field locations and the ability to describe external fields the structure of which cannot be determined by parsing. The implemented infrastructure makes it possible to create and modify the representation using application programming interfaces. An algorithm is proposed for parsing binary data using the specified model based on the concept of computability of fields. A domain-specific language for data format specification is also described. The specified formats and potential practical applications of the model for programmatic analysis of formatted data are discussed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Since different languages have different primitives, partial format descriptions with a similar structure were used.
REFERENCES
Back, J., DataScript—A specification and scripting language for binary data, Lect. Notes Comput. Sci., 2002, vol. 2487, pp. 66–77.
Khmelnov, A.Y., Bychkov, I.V., and Mikhailov, A.A., A declarative language FlexT for analyzing and documenting binary data formats, Trudy ISP RAN, 2016, vol. 28, no. 5, pp. 239–268. https://doi.org/10.15514/ISPRAS-2016-28(5)-15
Kaitai Struct: Declarative binary format parsing language. https://kaitai.io/.
McCann, P.J. and Chandra, S., Packet Types: abstract specification of network protocol messages, ACM SIGCOMM Comput. Commun. Rev., 2000, vol. 30, no. 4, pp. 321–333.
Pang, R., Paxson, V., et al. Binpac: a yacc for writing application protocol parsers, Proc. of the 6th ACM SIGCOMM Conference on Internet Measurement (IMC '06), 2006, pp. 289–300.
Borisov, N., Brumley, D., et al. Generic application-level protocol analyzer and its language, Proc. of the Network and Distributed System Security Symposium, 2007.
Hopcroft, J.E., Motwani, R., and Ullman, J.D., Introduction to Automata Theory, Languages, and Computation, 3rd ed., Pearson, 2006.
Knuth, D.E., Semantics of context-free languages, Math. Syst. Theory, 1968, vol. 2, no. 2, pp. 127–145.
Ford, B., Parsing expression grammars: a recognition-based syntactic foundation, ACM SIGPLAN Notices, 2001, vol. 39, no. 1, pp. 111–122.
Jim, T., Mandelbaum, Y., and Walker, D., Semantics and algorithms for data-dependent grammars, Proc. of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2010, pp. 417–430.
Afroozeh, A. and Izmaylova, A., Iguana: A practical data-dependent parsing framework, Proc. of the 25th International Conference on Compiler Construction, 2016, pp. 267–268.
Earley, J., An efficient context-free parsing algorithm, Commun. ACM, 1970, vol. 13, no. 2, 1970, pp. 94–102.
Jim, T. and Mandelbaum, Y., A new method for dependent parsing, Proc. of the 20th European Conference on Programming Languages and Systems, 2011, pp. 378–397.
Ganty, P., Köpf, B., and Valero, P., A language-theoretic view on network protocols, Lect. Notes Comput. Sci., 2017, vol. 10482, pp. 363–379.
Peach: a fuzzing framework which uses a DSL for building fuzzers and an observer based architecture to execute and monitor them. https://github.com/MozillaSecurity/peach.
Netzob: Protocol Reverse Engineering, Modeling and Fuzzing. https://github.com/netzob/netzob
Sommer, R., Amann, J., and Hall, S., Spicy: A unified deep packet inspection framework for safely dissecting all your data. Proc. of the 32nd Annual Conference on Computer Security Applications, 2016, pp. 558–569.
Fisher, K., Mandelbaum, Y., and Walker, D., The next 700 data description languages, ACM SIGPLAN Notices, 2006, vol. 4, no. 1, pp. 2–15.
Fisher, K. and Gruber, R., PADS: A domain-specific language for processing ad hoc data. ACM SIGPLAN Notices, 2005, vol. 40, no. 6, pp. 295–304.
boofuzz: Network Protocol Fuzzing for Humans. https://github.com/jtpereyda/boofuzz/.
GitLab Protocol Fuzzer Community Edition. https://gitlab.com/gitlab-org/security-products/protocol-fuzzer-ce.
010 Editor - Pro Text/Hex Editor. https://www.sweetscape.com/010editor/.
GNU poke, an extensible editor for structured binary data. https://doi.org/10.5446/46118
Solov’ev, M.A., Bakulin, M.G., et al. Practical abstract interpretation of binary code, Trudy ISP RAN, 2020, vol. 32, no. 6, pp. 101–110. https://doi.org/10.15514/ISPRAS-2020-32(6)-8
Solov’ev, M.A., Bakulin, M.G., et al. Next generation intermediate representations for binary code analysis, Trudy ISP RAN, 2018, vol. 30, no. 6, pp. 39–68. https://doi.org/10.15514/ISPRAS-2018-30(6)-3
Cousot, P. and Cousot, R., Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints, Proc. of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 1977, pp. 238–252.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by A. Klimontovich
Data parsing algorithm in accordance with the proposed model (pseudocode)
Data parsing algorithm in accordance with the proposed model (pseudocode)
INPUT: pointer to data (format instance and starting address)
OUTPUT: structure of result Data
1. For the set of unparsed relations
2. Take a new relation from the of unparsed relations
3. Relation type:
- internal or external:
3.1. Determine the computability of location
- Not computable: goto Step 2
- Computable: calculate -> POS
3.2. If POS = None, then mark the relation as parsed and goto Step 2
3.3. Determine the computability of the format instance:
- Not computable: goto Step 2
- Computable: calculate -> FORMAT
3.4. Relation type:
- internal:
3.4.1. For (POS, FORMAT) call Algorithm
3.4.2. Add the parsing result to the structure Data
3.4.3. Mark the relation as parsed
- external:
3.4.4. Add (POS, FORMAT) to the structure Data
3.4.5. Mark the relation as parsed
- value relation:
3.5. Determine the computability of the value:
- Not computable: goto Step 2
- Computable:
3.5.1. Calculate -> VALUE
3.5.2. Add VALUE to the structure Data
3.5.3. Mark the relation as parsed
4. If there are relations in the set of unparsed ones, then goto Step 2
5. If no relation was parsed, then return ERROR
6. Goto Step 1
Rights and permissions
About this article
Cite this article
Evgin, A.A., Solovev, M.A. & Padaryan, V.A. A Model and Declarative Language for Specifying Binary Data Formats. Program Comput Soft 48, 469–483 (2022). https://doi.org/10.1134/S0361768822070040
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768822070040