Bacoor Logo.png

Data structure in Ethereum. Episode 1: Recursive Length Prefix (RLP) Encoding/Decoding.

 

Image source: 

https://www.behance.net/gallery/36191437/Mini-Machines-02 by pixego

 

There are literally a lot of papers and blogs which explain how Ethereum organizes its data, but they all seem so disconnected and really hard for you to get an overall picture from them.

But this series can help you do that, by explaining the data structure in Ethereum through 3 problems below:

  1. Recursive Length Prefix (RLP) Encoding/Decoding.

  2. Trie — Radix, Merkel, Patricia.

  3. State Tree Pruning.

 

Recursive Length Prefix (RLP)

In computer science, data serialization is necessary for many complex data forms to be stored or transmitted in an only one formal format. Because of that, RLP is an encoding/decoding algorithms help Ethereum to serialize data and possible to reconstruct them quickly.

 

RLP Encoding

As Ethereum mentioned, the RLP encoding function takes in an item. An item is defined as follows:

  • A string (will be converted to byte array) is an item

  • A list of items is an item

For example, all of objects below are items:

  • “dog”

  • []

  • [“dog”]

  • [[], “dog”, [“cat”], “ ”]

RLD encoding is defined as follow:

  1. If input is a single byte in the [0x00, 0x7f]range, so itself is RLP encoding.

  2. If input is non-value (uint(0), []byte{}, string(“”), empty pointer …), RLP encoding is 0x80. Just please notice that 0x00value byte is not non-value.

  3. If input is a string with 2–55 byte long, RLP encoding consists of a single byte with value 0x80plus the length of the string in byte and then array of hex value of string. It’s easy to see that the first byte is in [0x81, 0xb7]
    For example: “hello world” = [0x8b, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64]

  4. If input is a string with more than 55 byte long, RLP encoding consists of 3 parts from the left to the right. The first part is a single byte with value 0xb7plus the length in bytes of the second part. The second part is hex value of the length of the string. The last one is the string in byte. The range of the first byte is [0xb8, 0xbf].
    For example: a string with 1024 “a” characters, so the encoding is “aaa…” = [0xb9, 0x04, 0x00, 0x61, 0x61, …]. As we can see, from the forth element of array 0x61 to the end is string in byte and this is the third part. The second part is 0x04, 0x00 and it is the length of the string 0x0400 = 1024. The first part is 0xb9 = 0xb7 + 0x02 with 0x02 being the length of the second part.

  5. If input is an empty array, RLP encoding is a single byte 0xc0.

  6. If input is a list with total payload in 0–55 bytes long, RLP encoding consists of a single byte with value 0xc0plus the length of the list and then the concatenation of RLP encodings of the items in list. The range of the first byte is [0xc1, 0xf7].
    For example: [“hello”, “world”] = [0xcc, 0x85, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x85, 0x77, 0x6f, 0x72, 0x6c, 0x64]. In this RLP encoding, [0x85, 0x68, 0x65, 0x6c, 0x6c, 0x6f] is RLP encoding of “hello”, [0x85, 0x77, 0x6f, 0x72, 0x6c, 0x64] is RLP encoding of “world” and 0xcc = 0xc0 + 0x0c with 0x0c = 0x06 + 0x06 being the length of total payload.

  7. If input is a list with total payload more than 55 bytes long, RLP encoding includes 3 parts. The first one is a single byte with value 0xf7plus the length in bytes of the second part. The second part is the length of total payload. The last part is the concatenation of RLP encodings of the items in list. The range of the first byte is [0xf8, 0xff] .

  8. There’s one more thing, though not mentioned in wiki Ethereum but in Golang source code: with boolean type, true = 0x01and false = 0x80 .

 

RLP Decoding

RLP decoding will become much easier when you figure out how RLP encoding works. Actually, RLP decoding just receives encoded input and decodes the type and the length of that data.

  1. According to the first byte of input, RLP decoding analyses data type, the length of the actual data and offset.

  2. According to the type and offset of data, decode data correspondingly.

  3. Continue to decode the rest of the input if still possible.

RLP decoding is full explained in wiki Ethereum. If you want to study more about it, please visit the links I put at the references below.

 

Diving into RLP

From wiki Ethereum, we can get that RLP just focuses on byte, string and list. Some extra types of data such as big number, boolean, pointer, slide… are based on which programming language we use to implement RLP.

As we knew, the first byte of encoded data decides which type of data.

  • [0x00, 0x7f]: byte

  • [0x80, 0xbf]: string

  • [0xc0, 0xff]: list

The first question, Why don’t we use a fixed prefix instead of a dynamic prefix?

 

First of all, you can see in RLP, sometime the data needs some bytes to describe the type and the length of data. But sometime the data also shows them by itself. The main reason is to save the memory space. If we try to use a fixed prefix, we have to add them in all of data we decode and in some situation, the main data is even shorter than the prefix.

 

You can say that by using dynamic prefix, it becomes simpler to read, but that is only applied to human. Computer cannot differentiate which is more complex. Running two source codes also has the same computational complexity.

Besides, even if it is fixed, we are still not able to be sure about how many bytes will be used. So that is unnecessary.

 

The second question, Why did they choose 0x7f, 0x80, 0xbf, 0xc0 as a checkpoint?

Just think about it sequentially. We don’t want to use any prefix with encoding a single byte because it may double (or more) the memory to store the encoded data. So we need to determine a range in which, the byte is encoded by itself.

 

 

It may be not accidental when 0x7f was chosen. The ASCII has used 7 bits to encode 128 single character and it correspond with 0x7f . I believe that is the reason why [0x00, 0x7f] was chosen. However, what is RLP encoding of byte with with 0x80 value.


The answer is we add a prefix, RLP_encode(0x80) = [0x81, 0x80] .

After that, about string and list, we have no choice but using prefix. It’s distinct when we divide the rest of range of prefix into 2 equal halves. [0x80, 0xbf] for string and [0xc0, 0xff] for list.

 

The third question, Why must we use a range to describe a type instead of one value of byte only? Isn’t one value of byte already enough?

It’s true that one value of byte is enough to represent the type of data. But in order to get the offset, one other thing we need to know is that how long the data is. And to do that, we must add more prefixes. So we got two problems now:

  • Firstly, using a byte just for storing 2 values: 0x80prefix for string and 0x81 for list, is such a waste. Since it’s capability is much more than that.

  • Secondly, by doing that, we are also trying to fix the prefix and as what I discussed on the first question, it’s a waste of memory.

We choose a range of byte to not only to encode type of data but also for the length of data.

The fourth question, What would we do when the length of data is out of range of prefix?

That’s is an understandable question. We add more dynamic prefixes after the first byte to represent the length of data. For example, with the [0x80, 0xbf]range of string type, according to our strategy, we divide this range into halves, one ([0x80, 0xb7]) for string with length in range and one ([0xb8, 0xbf]) for string with the length out of range. The same goes with list.

 

Conclusion

We hope this article gives you a piece of knowledge about the motivation and intuition of RLP and we can partly understand how RLP works in Ethereum.

 

References:

Document from wiki Ethereum

RLP on original Golang code

RLP on my Golang code with some debug line

 

Bacoor Vietnam

 

Share on Facebook
Share on Twitter
Please reload

Please reload

Copyright © 2019 Bacoor Inc. All rights reserved.