-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which of the multibase formats preserves lexicographic order? #124
Comments
I don't believe we have such a base encoding in multibase, see https://github.com/multiformats/multibase for the current list. I can imagine a use for this (and can imagine how Dean is using it too!) so it might be worth getting an entry in to the multibase table (the linked repo, see multibase.csv). It might take a bit of discussion with others over there as we haven't added a new one in a while and it doesn't get that much attention. It would also be good if you had some working code that worked as a js-multiformats multibase codec (maybe in this repo, although it could also just be an external package that we link in the README here like we do with codecs and hashers). Unfortunately Dean's code is using |
This article |
Sorry I meant this article: https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding |
Looking at bit deeper, maybe we're already covered? Have a look at the dictionaries used for the bases in this repo and see which are properly ordered. https://github.com/multiformats/js-multiformats/blob/master/src/bases/base58.js suggests that |
Right, so to expand on the above point, I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented. |
tbh, I don't even know if that same problem persists for base32 variants; you're going to have to do some experimentation I think |
Can you expand on what you mean by this?
I imagine hex encoding should preserve lexicographic order. But I guess this also depends on what is lexicographic order? My understanding is that the order of a string should be based on the comparison on each individual byte in the string. Since bytes are just 8 bit numbers, then if you have In JS, it says string sort is based on utf-16 code units. I'm not sure if that is equivalent to my understanding. |
Yeah, actually I'm not at all sure how to make this work properly, with all of the bases (that are not base-256) you're dividing up bytes into pieces and having to span across bytes as you slice. Even for base64 you're slicing the bytes into 6-bit pieces, so you get the first 6 bits of a byte, then the remaining 2 and 4 of the next, then the remaining 4 of the next and 2 at the start of the next, then the remaining 6, etc. etc. So how do you retain proper sorting when you're splitting up bytes like that? Maybe I'm not fully grokking what base64-lex is doing but I find it difficult to imagine. Base-16/hex is splitting bytes in half and presenting 2 characters for each byte, I guess that might retain sorting if you did it right. Then it all comes down to endianness which is just a mess! |
I did tests on 100,000 Ids generated by base58btc, they were all correctly sorted. Whereas less than 100 ids generated with base64 and I got a sort order error. So I'm guessing base58btc is suffiicent. Did my tests here: MatrixAI/js-id#8 |
Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar. |
Does it work in the same principle that hex encoding works. For hex,
every byte can be presented by 2 hex digits. So `0 - F` is half of a
byte, then `0 - F` is the next half.
Hex encoding's bit-order also make sense. So for `FF`, the left most hex
is the most significant 4 bits, and the right hex is the least
significant 4 bits. Thus the ordering is correct for `01` all the way to
`FF`.
Hex numbering is also 0-1 then A-F, and that is also correct in terms of
lexicographic order.
For a string of multiple bytes, each byte gets encoded into 2 hex digits
(if padded), or just 1 or 2 hex digits. These should be equivalent from
a sorting perspective. Anyway for `[123, 124]` that becomes `7b7c` and
`[124, 125]` is `7c7d`. Imagine we wanted to compare each other for
sorting. Assuming lexicographic sort is based on each character (which
in this case it is not precisely the case, because JS uses utf-16), then
we are comparing each character by character.
```
7b7c
7c7d
```
The `7b7c` would be considered to come before `7c7d`, simply at index 1:
`b < c`.
In base58btc the exact relationship between bytes to digits is not as
exact as hex.
…On 22/10/2021 1:20 pm, Rod Vagg wrote:
Well that's fascinating! What can we say about the limitations of
these tests? Am I correct that these IDs you're testing are quite
short, is this likely to scale to arbitrary lengths? This might be
interesting for others who come looking for something similar.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE4OHM4PJ44Q2KP7OJW4ITUIDC6HANCNFSM5GNCQ6OA>.
|
I realise that I don't understand base58btc internals, I can do a test with 100,000 random strings, and sort them, then convert them to base58btc and then sort it again and compare for a sanity check. |
Ok my tests show that import type { Codec } from 'multiformats/bases/base';
import crypto from 'crypto';
import { bases } from 'multiformats/basics';
function randomBytes(size: number): Uint8Array {
return crypto.randomBytes(size);
}
type MultibaseFormats = keyof typeof bases;
const basesByPrefix: Record<string, Codec<string, string>> = {};
for (const k in bases) {
const codec = bases[k];
basesByPrefix[codec.prefix] = codec;
}
function toMultibase(id: Uint8Array, format: MultibaseFormats): string {
const codec = bases[format];
return codec.encode(id);
}
function fromMultibase(idString: string): Uint8Array | undefined {
const prefix = idString[0];
const codec = basesByPrefix[prefix];
if (codec == null) {
return;
}
const buffer = codec.decode(idString);
return buffer;
}
const originalList: Array<Uint8Array> = [];
const total = 100000;
let count = total;
while (count) {
originalList.push(randomBytes(36));
count--;
}
originalList.sort(Buffer.compare);
const encodedList = originalList.map(
(bs) => toMultibase(bs, 'base58btc')
);
const encodedList_ = encodedList.slice();
encodedList_.sort();
// encodedList is the same order as originalList
// if base58btc preserves lexicographic-order
// then encodedList_ would be the same order
for (let i = 0; i < total; i++) {
if (encodedList[i] !== encodedList_[i]) {
console.log('Does not match on:', i);
console.log('original order', encodedList[i]);
console.log('encoded order', encodedList_[i]);
break;
}
}
const decodedList = encodedList.map(fromMultibase);
for (let i = 0; i < total; i++) {
// @ts-ignore
if (!originalList[i].equals(Buffer.from(decodedList[i]))) {
console.log('bug in the code');
break;
}
} Changing it to |
Looking at the errors and trialing them out:
According to rfc4648, the So there you have it, the conclusion. Pretty much |
JS's binary string form also preserves sort order #122. |
I suggest that we should add in a base64 variant that preserves order since base32 is quite a low base. And probably change over identity encoding to binary string form mentioned in #122. |
I've come across a situation requiring the ability to base-encode but preserve the lexicographic-order of the input bytes.
That is the order of the base encoding alphabet should be in the same order as those bytes that are mapped to the base encoding.
An example of this library is https://github.com/deanlandolt/base64-lex which is a base64 that preserves lexicographic order.
Which base encodings in multibase support preservation of lexicographic order?
The text was updated successfully, but these errors were encountered: