Support other ngram groups. #7

shaleh · 2024-05-30T16:05:45Z

Refactored ngrams.rs into a module directory to support other groups of ngrams. Added 'programming' as the first new group. This is comprised of the top 100 keywords among popular programming languages. To make this easier I defined a trait called NgramData and some matching enums and empty structs.

Yes, there are overlaps with the standard ngrams for English. I had two thoughts here.

Thinking ahead to when people might want to use this in their native language instead of English. So the programming ngrams might be helpful there as well as establishing the trait so adding others is easier than forcing everyone to use a file.
I really wanted to include symbol pairs like <= or ->. However, this does not work because the current definition for the keymaps so not have symbols because it does not know about shifts on qwerty and the like. That coupled with the fact that the on screen keyboard does not show the numbers/symbols row and things are not setup for it.

Together, that means this is something of an aspirational PR. Might be worth applying it now and then iterating on improvements until the final state is reached?

Also, this PR has functions to break words into N sized ngrams. Assuming the logic is correct that could also make it easier for people to use word files. They can provide a dictionary and ask for trigrams and let the code parse it out.

Refactored ngrams.rs to support other groups of ngrams. Added 'programming' as the first new group. This is comprised of the top 100 keywords among popular programming languages. To make this easier I defined a trait called `NgramData` and some matching enums and empty structs.

shaleh · 2024-05-30T16:07:57Z

What is inspiring my adoption of this app is I am working on learning colemak and a split ortho keyboard with 52 keys. My symbols are on a layer using QMK. So having a way to practice them and evaluate various layouts would be nice. But if this is the wrong tool that is totally ok.

shaleh · 2024-05-30T16:13:47Z

I used "group" instead of "language" because maybe there are other things than languages here -- like programming keywords. Very much an arbitrary decision and open to better clarity if there are suggestions.

shaleh · 2024-05-30T16:18:59Z

src/ngrams/programming.rs

+use itertools::Itertools;
+
+pub struct ProgrammingData();
+


I used the common data instead of making this Rust specific. People might use the tool and not know or care about the language. That means not all of the Rust keywords made the cut.

shaleh · 2024-05-30T16:20:24Z

src/ngrams/programming.rs

+    "this", "throw", "true", "try", "type", "typedef", "typeof", "union", "unsigned", "until",
+    "using", "var", "void", "volatile", "when", "where", "while", "with", "xor", "yield",
+];
+


This is the "magic" that breaks up the words in ngrams. In theory this could be moved to mod.rs and exposed. Then it could be used to parse dictionaries the user provides instead of requiring them to have ngram lists.

shaleh · 2024-05-30T16:21:22Z

src/ngrams/programming.rs

+            .map(|(to, c)| &source[from..from + to + c.len_utf8()])
+    })
+}
+


All of the methods below use the itertools unique method to ensure there are no duplicate entries.

Also map(String:from) instead of map(|s| s.to_string()).

src/ngrams/english.rs

wintermute-cell · 2024-06-01T22:11:22Z

Looks interesting, I'll take a closer look at this soon. I'm pretty busy at the moment so it might be a while until I can find the time!

shaleh · 2024-06-02T00:47:52Z

All good. No rush.

shaleh commented May 30, 2024

View reviewed changes

src/ngrams/english.rs Outdated Show resolved Hide resolved

Update src/ngrams/english.rs

56d0fab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support other ngram groups. #7

Support other ngram groups. #7

shaleh commented May 30, 2024

shaleh commented May 30, 2024

shaleh commented May 30, 2024

shaleh May 30, 2024

shaleh May 30, 2024

shaleh May 30, 2024

shaleh May 30, 2024

wintermute-cell commented Jun 1, 2024

shaleh commented Jun 2, 2024

Support other ngram groups. #7

Are you sure you want to change the base?

Support other ngram groups. #7

Conversation

shaleh commented May 30, 2024

shaleh commented May 30, 2024

shaleh commented May 30, 2024

shaleh May 30, 2024

Choose a reason for hiding this comment

shaleh May 30, 2024

Choose a reason for hiding this comment

shaleh May 30, 2024

Choose a reason for hiding this comment

shaleh May 30, 2024

Choose a reason for hiding this comment

wintermute-cell commented Jun 1, 2024

shaleh commented Jun 2, 2024