Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Jakob's Blog coding for fun and growth

Rust in a coding competition (Part 3) - Filling in the gaps

Python, Haskell, C, or Rust? Which do you like best to solve this task?

This is the third post in a series. Follow the link to start reading part 1 first, where you can also find the problem statement and the first part of the solution.

Second subtask: Guessing

Okay, we are still trying to decrypt an encrypted message. From the task description, we know the encryption algorithm but the key will be different in different test cases. In the previous two posts, I have presented how to decode a message, given the key. Next, we want to generate all possible keys (simply all ASCII characters), decrypt the message with all of them and finally decide which of the received messages is most likely to be what has initially been sent. Let us start with a function that evaluates the credibility of a given string to be an English sentence.

The strategy

First up, we do not have access to a dictionary because we have to upload the code to the server to be evaluated and the memory and execution time is restricted. So we have to come up with some other idea. To decide on a strategy, let us have a look at some of the generated strings by our decoder with wrong keys:

Ex+,x~eogu,xc,~cog,m,~duai,xc,~cog,m,~duai,xdmx+,~ekdx,cb,xeai

Jw$p#wqj`hz#wl#ql`h#b#qkznf#wl#ql`h#b#qkznf#wkbw$p#qjdkw#lm#wjnf

Kv%q”vpkai{“vm”pmai”c”pj{og”vm”pmai”c”pj{og”vjcv%q”pkejv”ml”vkog

Hu&r!ushbjx!un!snbj!`!sixld!un!snbj!`!sixld!ui`u&r!shfiu!no!uhld

Ns t’sundl~’sh’uhdl’f’uo~jb’sh’uhdl’f’uo~jb’sofs t’un`os’hi’snjb

Or!u&rtoem&ri&tiem&g&tnkc&ri&tiem&g&tnkc&rngr!u&toanr&ih&rokc

While the correct message is:

It’s tricky to rock a rhyme to rock a rhyme that’s right on time

As we can see, most messages are full of special characters, all placed randomly after each other. The message we are interested in, on the other hand, is nicely grouped into words and separated by spaces. So let us see if we can exploit that.

Evaluating each candidate string

Most simply, we can just count the number of words we get if we split up the message at each whitespace characters like this:

msg.split(char::is_whitespace).count();

The message with the highest count is then:

7 Y

^ ^ ^^ ^ ^ ^^ ^  Y ^  ^^ 

Okay, that is obviously not what we want. We do not know whether line breaks will be contained in some of the hidden test cases, however, we can be quite sure that the words generally consist of alphabetic characters. So let us count only count those words made of letters only:

msg.split(char::is_whitespace)
        .filter(|word| {word.chars().all(char::is_alphabetic)})
        .count()

And sure enough, here is the top scoring message guessed correctly!

It’s tricky to rock a rhyme to rock a rhyme that’s right on time

However, looking at one of the other sample test cases we find new problems. We get the following top scores:

Score = 8
dEAR mR iM tOO gOOD TO cALL OR wRITE mY fANS

Score = 6
eD@SlShL uNN fNNE UN b@MM NS vSHUD lX g@OR

Score = 6
cBFUjU n J sHH `HHC SH dFKK HU pUNSB j^ aFIT

And the correct output Dear Mr. I’m-Too-Good-to-Call-or-Write-My-Fans only shows up at rank 14 with a score of 1.

Score = 1
Dear Mr. I’m-Too-Good-to-Call-or-Write-My-Fans

Apparently splitting by whitespace only is not enough for this task. To pass all sample tests as well as all hidden test cases, I had to add two more tweaks.

  1. I also split words on dashes.
  2. If a word starts with an upper-case and ends with a lower-case letter, I give it an extra score boost.

This is the final code I used in the competition:

fn evaluate_credibility(msg: &String) -> u64 {
    let mut score = 0;
    let words = msg.split(|c: char|{ c.is_whitespace() || c == '-'});
    for word in words {
        if word.chars().all(char::is_alphanumeric) {
            score += 5;
            if word.starts_with(char::is_uppercase) 
                && word.ends_with(char::is_lowercase)
            {
                score += 20;
            }
        }
    }
    score
}

Note how easy to understand this code is! Considering that Rust is a highly performance-optimized systems programming languages, it strikes me how we can write such simple looking code and how the complexity is hidden so well. You do not have to understand what closures or iterators are but you will still perfectly understand what this code does.

This gives us a clear winner for all test cases. Taking the example we have failed on before, here are the top three candidates:

Score = 185
Dear Mr. I’m-Too-Good-to-Call-or-Write-My-Fans

Score = 80
Ihl-@#-D*` Ybb Jbbi yb Nlaa b Zdyh @t Klc~

Score = 30
eD@S╔lS╔h═L uNN fNNE UN b@MM NS vSHUD lX g@OR fANSE 40 dEAR mR iM

I would say, that is a fair result!

Finalizing the solution

All right, we have all the tricky parts prepared, now we just need to read from the standard input and output the result. Here is one way of doing this with Rust:

fn main() {
    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let input= line.unwrap();
        let best_guess = 
         (0..128).map(|key| { decode(&input, key) } )
          .max_by_key( |msg| { evaluate_credibility(msg) } );
    println!("{}", best_guess.unwrap());
    }
}

Next, As this series is mainly about comparing different solutions in different languages, I will present implementations in other languages, doing the equivalent of above code.

Each implementation will rely on the decode function presented in part 1 of the series. The full source code is also available on my GitHub.

Using Haskell

To split a string into words separated by whitespace, there is a cool function named words in the Prelude. To also consider dashes, there is the splitOn function which takes an argument, what should be the splitting character. So the idea is to apply one after the other and summarize the result in one final list of words like that:

import Data.Text (splitOn)
splitWords :: String -> [String]
splitWords msg = concatMap words (splitOn "-" msg)

Unfortunately, this code does not compile. We need some extra type conversion because splitOn operates on Text rather than a simple String. Here is how to fix it:

import Data.Text (splitOn, pack, unpack)
splitWords :: String -> [String]
splitWords msg = concatMap words (map unpack (splitOn (pack "-") (pack msg)))

Continuing with the scoring of the resulting list of words, here is an implementation relying on a foldl and an auxiliary function sumScore which accumulates the scores of each word.

import Data.Char (isAlphaNum, isLower, isUpper) 
evalCred  :: String -> Integer
evalCred s = foldl sumScore 0 (splitWords s)
    where sumScore acc word = acc + 
            if all isAlphaNum word 
                then if isUpper (head word) && isLower (last word)
                    then 25
                    else 5
                else 0

I suppose if-then-else clauses do not feel all that typical for Haskell. Still, it is relatively concise in what it is doing and in this case, it seems to be a nice fit.

Going on to find the maximum of all scores, there is a pretty short way, as we would expect in Haskell.

import Data.Ord (comparing)
import Data.List (maximumBy)
solve :: String -> String
solve s = maximumBy (comparing evalCred) (map (\key -> decode key s) [0..127])

And finally, to read from and write to the terminal we have to do something like this:

import Test.QuickCheck.Text(oneLine)
main :: IO ()
main = interact $ show . solve . oneLine

That is all. I think the whole Haskell implementation is simple enough to read and quickly understand what it is doing. But admittedly, as I am usually not programming in Haskell, getting all the details with parenthesis and types right took me some time and effort.

Using Python

To split words in Python, there is the function split that does almost exactly what we need. Called without an argument, it will simply split by whitespace. Called with an argument, it will split by the string given as argument. However, to split by both, we have to make two consecutive calls, similar to the Haskell example.

Using the list comprehension typical for Pyhton, my evalultion function ends up looking like this:

def evaluate_credibility(str):
  tokens = str.split()
  words = [ word for token in tokens for word in token.split("-") ]
  return sum(score_words(words))

The list comprehension looks a bit messy to me as a Python beginner. But I suppose for programmers who are used to work with lists in this manner, this is very straightforward and easy to understand.

For the scoring of the words, a generator that takes a list of words and yields scores for them is the perfect abstraction in my opinion.

def score_words(words):
   for w in words:
      if w.isalnum(): 
        if len(w) > 1 and w[0].isupper() and w[-1].islower():
          yield 25
        else:
          yield 5

At this point, generating all messages and choosing the one with the highest score is just a two liner.

def solve(input):
  messages = [decode(input, key) for key in range(128)]
  return max(messages, key=evaluate_credibility)

And we are already almost done with the Python example. Just a few lines to read the input and print out the result.

data = sys.stdin.read().splitlines()
for line in data:
    print (solve(line))

Using C

Okay, take a deep breath before you continue reading. Although we have the all-powerful stdlib available in C, the API is older and pursuing other goals than the API found in Python, Haskell, or Rust. Therefore, the implementation in C is the longest and most complex one presented here. So, just be aware of that and excuse me if I am not able to fully cover every weird little detail of my implementation you are about to see.

Starting with the easy part, here is our main function:

int main() {
    char* input = calloc(1, MAX_LENGTH+1);
    gets(input);
    solve(input);
    return 0;
}

Well, this interaction with the terminal is really as easy as it gets from an API perspective. But it is also deprecated and generally a bad idea to use it in a productive environment. For a competition like Codecon on the other hand, it is perfectly viable.

Note that memory is allocated with calloc rather than malloc. The reason is that gets will not NULL terminate the string it reads in and therefore we have to make sure the memory used to store the input is initially zeroed out. Also, I omitted to free the memory at this point, simply because it would be at the last line of our program, at which point all memory is being freed anyway.

Now let us move on to the meaty part. Here is the code for scoring a candidate message:

long evaluateCredibility(char* msg) {
  int len = strlen(msg);
  if(len <= 0){ return -1; }
  char* msgCopy = malloc(2 * len);
  strcpy(msgCopy, msg);
  long score = 0;
  char splitOnChars[] = " \t\n\v\f\r-";
  char* word = strtok (msgCopy, splitOnChars);
  do {
    score += scoreWord(word);
  } while(word = strtok (NULL, splitOnChars));
  free(msgCopy);
  return score;
}

The function strtok is used to split up the string into words, splitting at the characters in the parameter string. There is no shortcut to split on all whitespace characters in C, so the parameter string explicitly contains all whitespace characters after each other and finally the dash.

The tricky part of strtok is that it will only chop off the first word of the complete string. Thus, we have to call it in a loop as shown above. Furthermore, strtok actually will modify the given input string. Therefore it is better to make a copy of the string first, or it will be destroyed by the time we want to write it to the terminal.

Talking about the loop condition, the way expressions and types work in C, I can use assignments like word = strtok (NULL, splitOnChars) as the loop condition, which personally I really like to do. I understand it can look confusing to people who have not seen it before and some people will have a strong case against such loop conditions. But I do not want to go into that discussion right now, and luckily, since this is a monologue, I do not have to.

To get the score for each word on its own, this function is used:

long scoreWord(char* word){
    int allAlphanumeric = -1;
    long score = 0;
    char* p = word;
    while(*p){
      allAlphanumeric &= isalnum(*(p++));
    }
    if(allAlphanumeric) {
      score += 5;
      if(isupper(*word) && islower(*(word + strlen(word) - 1))){
        score += 20;
      }
    }
    return score;
}

This loop conditions works, because the end of the string is by definition a '\0' character. The pointer p goes through each character and the loop will end when that character is binary zero.

What is missing here, is the part that decodes with all 128 keys and evaluates the result. Of course, that means we are going to see another loop and some messy memory management.

void solve(char* input){
  long maxScore = -1;
  char* bestGuess = NULL;
  int i;
  for(i = 0; i < 128; i++) {
    char* output = decode(input, i);
    long score = evaluateCredibility(output);
    if(score > maxScore) {
      maxScore = score;
      if(bestGuess != NULL){
        free(bestGuess);
      }
      bestGuess = output;
    }
    else {
      free(output);
    }
  }
  printf("%s", bestGuess);
  free(bestGuess);
}

I would not say it is remarkably complicated to write or understand such code. But I am sure it is very easy to make mistakes and spread bugs everywhere, which is a big concern in my opinion.

Surely an experienced C programmer would find a cleaner way to handle the memory management but comparing it to the three other languages, there we simply did not have to care about memory management at all! While Haskell and Python give up some performance and require a garbage collector at run-time, Rust determines automatically at compile-time where the calls to malloc and free should be placed inside the final binary.

And just in case you do not believe that my C code actually works without crashing, here is how it looks in my Windows Bash terminal:

 anonymous@PC:/some/path$ gcc xor_cypher.c -o xor.o
 xor_cypher.c: In function ‘main’:
 xor_cypher.c:74:5: warning: ‘gets’ is deprecated (declared at /usr/include/stdio.h:638) [-Wdeprecated-declarations]
      gets(input);
      ^
 /tmp/ccGLhuFc.o: In Funktion `main':
 xor_cypher.c:(.text+0x2ee): Warnung: the `gets' function is dangerous and should not be used.
 
 anonymous@PC:/mnt/c/dev/community/blog/CodeCon$ ./xor.o
 023f6c386b3f39222820326b3f246b392428206b2a6b392332262e6b3f246b392428206b2a6b392332262e6b3f232a3f6c386b39222c233f6b24256b3
 It's tricky to rock a rhyme to rock a rhyme that's right on timeanonymous@PC:/some/path$

At the end of the day

To sum up my opinion about the four presented languages, regarding their use in a coding competition:

  • C is definitely not the best fit for such a task. But considering it is but just one step away from assembly, I think we should appreciate that we can still solve such a task relatively easy.
  • Haskell is cool for people who really like functional programming. Assuming you are good at that, Haskell produces very readable and sometimes impressively simple code for complex problems.
  • Python is the easiest language to get a working solution quickly without bothering about the type system too much. The code is also quite readable and easy to understand, even if you are not an expert at the language.
  • Rust is most flexible and lets the programmer do exactly what he or she wants in a concise way and the code is generally very readable. To write code quickly, however, some good experience using the language is necessary, or the program will potentially not even compile a single time before the time is up.

As a side note, here I have skipped the performance discussion completely. I am looking forward to dedicating a full blog post about a performance comparison of different languages. There, I will probably also present how the code samples from this series perform compared to each other.

Technology Stack

Rust

The Rust programming languages had its first stable release in May 2015. It has been designed with performance and low-level programming as a high priority. Therefore, it is well suited for applications that would otherwise be done in C++. The huge improvements of Rust over C++ are its strong safety guarantees like data-race freedom and memory safety at compile time.

Haskell

A purely functional programming language with a strong, static type system. Especially beloved in academia but also relevant to the industry.

Python

An interpreted language with a primary focus on code readability. It is strongly, dynamically typed and is often used in modern science to perform computations or even to train machine-learning models.

C

Arguably the most important programming language in human history. If you have not heard about it, go and read it up on Wikipedia by clicking on the title.