-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support other ngram groups. #7
base: master
Are you sure you want to change the base?
Conversation
Refactored ngrams.rs to support other groups of ngrams. Added 'programming' as the first new group. This is comprised of the top 100 keywords among popular programming languages. To make this easier I defined a trait called `NgramData` and some matching enums and empty structs.
What is inspiring my adoption of this app is I am working on learning colemak and a split ortho keyboard with 52 keys. My symbols are on a layer using QMK. So having a way to practice them and evaluate various layouts would be nice. But if this is the wrong tool that is totally ok. |
I used "group" instead of "language" because maybe there are other things than languages here -- like programming keywords. Very much an arbitrary decision and open to better clarity if there are suggestions. |
use itertools::Itertools; | ||
|
||
pub struct ProgrammingData(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the common data instead of making this Rust specific. People might use the tool and not know or care about the language. That means not all of the Rust keywords made the cut.
"this", "throw", "true", "try", "type", "typedef", "typeof", "union", "unsigned", "until", | ||
"using", "var", "void", "volatile", "when", "where", "while", "with", "xor", "yield", | ||
]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "magic" that breaks up the words in ngrams. In theory this could be moved to mod.rs and exposed. Then it could be used to parse dictionaries the user provides instead of requiring them to have ngram lists.
.map(|(to, c)| &source[from..from + to + c.len_utf8()]) | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the methods below use the itertools unique
method to ensure there are no duplicate entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also map(String:from)
instead of map(|s| s.to_string())
.
Looks interesting, I'll take a closer look at this soon. I'm pretty busy at the moment so it might be a while until I can find the time! |
All good. No rush. |
Refactored ngrams.rs into a module directory to support other groups of ngrams. Added 'programming' as the first new group. This is comprised of the top 100 keywords among popular programming languages. To make this easier I defined a trait called
NgramData
and some matching enums and empty structs.Yes, there are overlaps with the standard ngrams for English. I had two thoughts here.
Together, that means this is something of an aspirational PR. Might be worth applying it now and then iterating on improvements until the final state is reached?
Also, this PR has functions to break words into N sized ngrams. Assuming the logic is correct that could also make it easier for people to use word files. They can provide a dictionary and ask for trigrams and let the code parse it out.