-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider capitalization #162
base: dev
Are you sure you want to change the base?
Conversation
Oh, I based of master instead of dev. Should I do this again? EDIT: I rebased. |
1daa536
to
f61b04d
Compare
Pull Request Test Coverage Report for Build 741
💛 - Coveralls |
Pull Request Test Coverage Report for Build 655
💛 - Coveralls |
The changes look good! However I feel that this procedure should be available to non-classifier disambiguation as well and hence lie a layer above the current changes. Further I would prefer to have systematic changes (i.e. swapping all letters or searching for the unit with least differences in capitalization) rather than swapping only the last letter which seems quite random and inconsistent to me (what about the unit "Km", we should swap the first letter here for a perfect match. What about "litre", is it useful to check for "litrE"?) |
Please keep changes unrelated to the specific issue out of this pull request. Seperate codestyle improving pull requests will be warmly welcomed |
OK, I created #164 for all my import sorting and there was a block not black formatted. I have moved the logic one layer up and also implemented it for no classifier. There is no need to consider any other case than the last letter (unit letter) to swap the case of, because it is only the first letter that causes the ambiguity because of the ambiguity in casing of the SI-prefixes. (milli and Mega, and pico and Peta) I did check if the unit was longer than 2 letters and disabled that additional case swapping for those. |
If you really only want to apply this check to prefixed units, we should check whether the unit has any SI-prefix and only apply this workaround in that case. On the other hand, the two letter check is too specific, consider for example the unit Bar. If someone writes "1 MBAR" it is currently either a milli- or a Mega-Bar, but according to this change it should be considered a Megabar preferably (on my device, rather deterministically, millibar is returned) |
Good suggestions. I applied them. |
Hmmm.. There is no Mbar in units.json, only a mbar and a μbar. So, it won't work anyway, but at least the code for it is now in place. |
I think this should be fixed then since the goal would be to support all Wikipedia featured units 😅 To fix the reduction in coverage, you would need to add a test case that triggers the branch you just added. |
Your test suite is a bit weird. Automatically loading a bunch of test cases is clever, but you are better off designing the tests yourself so that you can have tests that are specific to certain edge cases. Or at least have those in addition to the automated tests. Now for example when one loaded test fails, all are flagged as failed. When a test fails you want to know what the specific problem is. In addition to that as a tip, if you load multiple cases and test in a loop, at least use subtests: https://docs.python.org/3/library/unittest.html#distinguishing-test-iterations-using-subtests Anyway, without any proper example on how to write a test for an edge case from your side it is very hard for me to entangle your code to write a test case that works. I attempted one, but I cannot create a Unit that is equal somehow. Can you take a look what I am doing wrong? |
I see your point about the test suite. Thanks for pointing me to the subtest system. I will have a look into it. Related to #157 there will likely be also a class of test cases that merely checks whether an error is thrown or not but ignores the outcome. Regarding your issue, I will have a look soon but am currently rather busy. Until now I've had trouble to find the cause of test failures myself. |
OK, ping me when you updated the test suite, than I can see how I can learn from that to make tests for this PR. |
I enhanced the test suite, feel free to have a look at the changes of #171 |
The test suite still misses a simple single test like I would want to add as well. It still all obscure automated testing, now with subtests though. To repeat what I said before, a proper test suite would test specific different cases. It is not good enough to just go grab a random bunch of cases and test them all and if you succeed more than 80 % call it good enough. With this PR I want to test specific cases, but without an example I cannot get it going so easily. I tried, if you look at this PR I have a test, but it fails, and since I don't have an example to copy from I don't know what I am doing wrong. I will paste it here in the comment as well for you to see: So what is wrong here? class ClassifierTestEdgeCases(unittest.TestCase):
"""Test suite that tests certain specifically designed edge cases to confuse the classifier."""
def test_wrong_capitalization(self):
parsed_unit = p.parse("1nw")
unit = Quantity(
value=1,
unit=Unit(
name="nanowatt",
entity=Entity("power"),
dimensions=[{"base": "nanowatt", "power": 1, "surface": "nW"}],
),
)
self.assertEqual(parsed_unit, unit) PS. The steps under contributing are not complete, because the test suite requires some more requirements. |
Note that there are two distinct classes of test cases. One class contains ambiguous cases, one class contains cases that (given the rules provided) should always output a certain output. For the first class, only a percentage of classes is required to be recognized correctly. For the second class the output has to match correctly. Your change affects the classifier (ambiguous cases) but it is possible to create cases where the output is deterministic. Thereforce the test case should be added to the latter class. It is simply added by extending "tests/quantities.json" with a sample sentence (in your case for example "The energy in this battery is 1nW") and the expected units parsed (see the other examples in the file). I hope that helps. |
Yes, but the problem of that approach is that those tests are not isolated. (1) I cannot run a single test that gives problems. (2) I have to run the whole suite every time. (3) I don't see what specific test case is testing. It is not a bad way to run your tests like that, but having specific designed tests that test a specific feature of the code would help a lot. Here for example I specifically wrote a test that does So, how can I run a single test like the way I showed in my previous comment. (Also if you stick to the default way of writing tests people can add their own tests. It is now unclear how your test suite works.) |
The way I personally handle this is to create a file (i.e. "test.py") and write in it the desired test case behavior (i.e. parser.parse("...") == [...]). A number of IDEs support running this in debug mode as does Then if everything works as expected I add the required test to the json file. I agree that this is not optimal procedure. However with the sheer amount of test cases and possible interactions of rules it seems to me that the checking cannot be nailed down to a specific rule and has to be considered holistically. |
OK, well I don't think that the best way to do it, but I think I already made my point. It would be a good idea to make some sort of contributing doc in which you explain how somebody contributing can add a test case. I have added one test case. The other cases discussed here cannot be added since units.json misses things like MegaBar and MegaLitre. |
I am not so sure what is going on with "1 '2". What triggers that to be parsed? Anyway, I wrote a work around. Also, just a tip, don't use |
@nielstron Would you still be able to take a look at this? |
return None | ||
|
||
|
||
def get_a_better_one(old, new): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name is really not saying anything. I also don't get what it does - Why is the length of units important? Is there not a more concise way to filter out None (potentially at the source of creating it)?
base = no_clf.disambiguate_no_classifier(base, text, lang) | ||
elif len(base) == 1: | ||
base = next(iter(base)) | ||
# If the unit is longer than two prefixes, we set everything to lower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole following flow is convoluted and not concise. I don't really see how there is not a ton of issues here.
- If a unit has no prefix but starts with "m" (like meter) this messes up stuff
- Why do we selectively lower/swapcase certain parts of the surface? Should we not try out all possible capitalizations?
- Why set everything to lower for units longer than two letters (nothing known about prefixes)?
In general, this whole flow needs more explanation like "units that have more than two letters are most likely case insensitive. Only the capitalization of the prefix matters. We therefore only try out lowering the last letters"
@@ -1373,5 +1373,15 @@ | |||
"surface": "three million, two hundred & forty" | |||
} | |||
] | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need way more test cases to check if this (rather big) change in the flow works as expected. Please include one test case for every edge case you have considered (i.e. more than two letters, metric prefix, no metric prefix, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be nice, but that would mean that units.json needs to be updated. It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.
It is weird that some units don't have all prefixes, and some units have more prefixes defined than others. Is it really a good idea to hardcode what prefixes belong to what units? Why don't have all prefixes available to all units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't have all prefixes available to all units?
Because one of the criteria for a unit to be listed is for it to be common.
In our case, that is defined by having a Wikipedia page (or being redirected to it).
This is not the case for all units (Take for example "Deci-tonne").
If you find a prefix-unit combination that has a Wikipedia page and is missing from units.json, feel free to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.
Megalitre
is correctly recognized by quantulum. Mlitre
and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre" (Please post one here as a comment). Adding <prefix><surface>
is then trivial by manipulating the load method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about mbar
and Mbar
? Please add test cases for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the case for all units (Take for example "Deci-tonne").
Yes, because a tonne is a different kind of unit. It is just MegaGram. We don't have to support Tonne -> MG conversion, that is clearly a non-goal, but we can support full prefix for units for which this is generally expected, such as SI units. If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this. (But I think a lot of people expect all SI units to just work with all prefixes.)
Megalitre is correctly recognized by quantulum. Mlitre and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre"
Sure, but you have asked me in the past to support such "nonsensical" cases, so that's why I added support for that and wanted to add test cases for that:
(what about the unit "Km", we should swap the first letter here for a perfect match. What about "litre", is it useful to check for "litrE"?)
.... sensible body of text that uses "MLitre" (Please post one here as a comment)
Well apparently Mlitre isn't as nonsensical as it me seems, here you have it:
Secondary sewage from the Hyperion works is now used in a variety of ways to alleviate this. In one application, commissioned in 1997, Memcor processes are used in 15 Mlitre/day CMF/RO system to supply the nearby Mobil and Chevron refineries with feed for their boiler water plants. In an earlier project Memcor supplied an 11.5 Mlitre/day CMF plant as pretreatment for RO to provide water used for injection in a barrier scheme to keep out further sea water.
Source: https://www.climate-policy-watcher.org/water-quality/v-1.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will look into that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the issue of recognizing units like "Mlitre" to #183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this.
The list of expectations for supported units is pretty clear in my opinion. But it may be considered to add options like "all_SI" or "all_Bin" for adding a certain set of prefixes to a unit. I moved this to #184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I will probably work on improving the flow for this PR when we I can also get these test cases available (because, yes, the flow is sometimes quite chaotic). Ping me when it gets ready.
In case of ambiguous matches, this will re-attempt the algo by swapping the case of the last letter of the unit, and see if this creates a better match. It will also throw a warning if an unambiguous match cannot be found.
EDIT: Fixes #161