Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add properties to List for byte-based encodings/hashes #924

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HT154
Copy link
Contributor

@HT154 HT154 commented Feb 2, 2025

The primary motivation here is to allow Pkl to properly work with arbitrary binary data. String can be abused to some extent, but because text encoding (and unicode normalization) is involved, what you see isn't what you get:

❯ pkl eval pkl:base -x 'let (_ = trace(0xff.toRadixString(16))) trace(0xff.toChar())' | hexdump                 
pkl: TRACE: 0xff.toRadixString(16) = "ff" (repl:text)
pkl: TRACE: 0xff.toChar() = "ÿ" (repl:text)
0000000 bfc3                                   
0000002

I'm most focused on base64 here, but I don't see an reason to omit the hashes either.

This can actually be implemented in-langage, but it's far from performant:

const local base64Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".chars
const function base64Bytes(bytes: List<UInt8>|Listing<UInt8>): String =
  let (padding = if (bytes.length % 3 == 0) "" else "=".repeat(3 - (bytes.length % 3)))
    bytes.fold(List(List()), (acc, elem) ->
      if (acc.last.length == 3) acc + List(List(elem))
      else acc.dropLast(1) + List(acc.last + List(elem))
    )
      .map((tuple) ->
      let (n = tuple[0].shl(16) + (tuple.getOrNull(1)?.shl(8) ?? 0) + (tuple.getOrNull(2) ?? 0))
        "\(base64Chars[n.ushr(18).and(63)])\(base64Chars[n.ushr(12).and(63)])\(base64Chars[n.ushr(6).and(63)])\(base64Chars[n.and(63)])"
    )
      .join("") + padding

Here's a test encoding a 16kB buffer:

❯ time ../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null
../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null  19.55s user 0.36s system 125% cpu 15.862 total

And here's the exact same evaluation switched to use the List.base64 property introduced in this PR:

❯ time ../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null
../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null  4.22s user 0.19s system 325% cpu 1.356 total

This is roughly a 12x speedup!

Methods on generic types that only work for some type arguments don't really exist in Pkl yet. I considered an entirely separate stdlib class (possibly even a List subclass?) that forces the element type to UInt8 I'm interested in hearing thoughts on how to best approach this, but this was the most expedient path to prove the concept.

As for why I'm doing this at all, here's a teaser:
Screenshot 2025-02-01 at 16 58 38

@bioball
Copy link
Contributor

bioball commented Feb 3, 2025

Another solution, which might be cleaner, is to have a pkl:bytes standard library module, which has:

external function base64(bytes: List<UInt8>)

@bioball
Copy link
Contributor

bioball commented Feb 3, 2025

Overall, I think we should think about how Pkl should be representing binary data first. Having a .base64 etc for binary data would naturally be part of the binary data API.

Some thoughts about using List<UInt8> for binary data:

Good parts:

  • We can probably optimize this by representing List<UInt8> as byte[] within our truffle implementation.
  • No new type introduced.

Concerns:

  • Ideally we'd want to export this as the corresponding byte array type in codegen (byte[] in Java, []byte in Go, ByteArray in Kotlin, [UInt8] in Swift). But there's nothing special about List<UInt8>
  • Would also need to special-case messagepack serialization? (should represent this using the binary message format, instead of the array message format).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants