February Meeting Notes: cryptography and simple histograms

For this month’s meeting, Joel and I talked about simple cryptography theory, and followed it up with discussing how a histogram can be used to help analyze and break certain types of cryptography schemes.

Histograms are just a way to graphically represent data. This can be color data from an image, or data in a text or binary file. Really, histograms are just simple bar graphs.

Read on for the rest of the details.

Without posting everything that Joel and I said in the meeting, it’d pretty difficult for me to convey exactly how histograms can be used for cryptanalysis. Simply put, the histogram shows the number of times each character appears in a file. In a simple letter-substitution scheme, it would be easy to see what letters show up most often in natural language and in the encrypted text. There’s a fairly good chance that you can start replacing letters that have similar frequency. Once you’ve accurately substituted enough letters in the encrypted text to form a few whole words or easily-guessed partial words, it becomes no more difficult to completely decrypt the message than playing a game of hangman that’s already half-solved.

Here are a few histograms I generated for large text files. This is useful for analyzing the frequency that certain characters appear in a file:

As you can see, the charts both top out at the same place. That’s a space character. Spaces are easily the most common character found in written text. All the bars to the left of the tall bar are “control characters” such as carriage returns. Directly to the right of the tall bar are symbols, numbers, upper-case and lower-case letters respectively.

Notice the whole right side of the above graphs are empty, because those are called “high ascii” characters that aren’t commonly found in written text, but are common in binary files.

This is a histogram of a file containing only random data:

And finally, a histogram of an OpenBSD binary executable file (which has a lot of nulls on the far left) throwing off the curve. Nulls are very common on executable files on almost any platform.

Finally, you can take a look at my code. It was a pretty quick hack for personal research reasons, but I decided to bring it up in the meeting today. I made sure to document most of the important logic in the code.


Thanks to everyone who showed up for this month’s meeting!

170 thoughts on “February Meeting Notes: cryptography and simple histograms

  1. Re: February Meeting Notes: cryptography and simple histo…
    There are formulas for breaking the code once you know the frequency distributions. For example in english, letter ‘a’ is the most popular you look at your graph and see: oh, that character must represent letters ‘a’. Individual letter frequencies in all languages are very well documented and available. That’s how they broke code during the WWII. It is really meant for primitive substitution or permutation ciphers. You can’t really use it for sophisticated stuff like PGP, DES, etc.
    • Re: February Meeting Notes: cryptography and simple histo…
      Yes, this sort of cryptanalysis is limited to certain types of simple encryption, usually of the symmetric key variety. Asymmetric encryption requires a much more complicated form of analysis.

      That said, symmetric key ciphers are not always weak, however their strength comes more from protocol than from technology. For instance, as I discussed in the PHP meeting, the result of using a one-time-pad style binary XOR using highly random data as the pad still looks like highly random data and there’s not a darn thing anyone will be able to do to break it without gaining access to the key. If proper protocol is followed, that would be nearly impossible. The best one could hope for would be to compromise the pad without either party finding out, thus being able to intercept a certain amount of future communications.

      Regardless, the histograms were a small part of a larger research project that Joel and I were working on with some mutual friends. Since it was written in PHP and it was a very simple project, I used it to spark the discussion for the meeting, and it was a good, fun time.

Leave a Reply

Your email address will not be published. Required fields are marked *