Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider capitalization #162

Open
wants to merge 20 commits into
base: dev
Choose a base branch
from

Conversation

hwalinga
Copy link
Contributor

@hwalinga hwalinga commented Sep 16, 2020

In case of ambiguous matches, this will re-attempt the algo by swapping the case of the last letter of the unit, and see if this creates a better match. It will also throw a warning if an unambiguous match cannot be found.

EDIT: Fixes #161

@hwalinga
Copy link
Contributor Author

hwalinga commented Sep 16, 2020

Oh, I based of master instead of dev. Should I do this again?

EDIT: I rebased.

@hwalinga hwalinga force-pushed the consider-capitalization branch from 1daa536 to f61b04d Compare September 16, 2020 19:16
@coveralls
Copy link

coveralls commented Sep 16, 2020

Pull Request Test Coverage Report for Build 741

  • 58 of 70 (82.86%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.8%) to 96.708%

Changes Missing Coverage Covered Lines Changed/Added Lines %
quantulum3/no_classifier.py 8 9 88.89%
quantulum3/disambiguate.py 33 44 75.0%
Totals Coverage Status
Change from base Build 734: -0.8%
Covered Lines: 1263
Relevant Lines: 1306

💛 - Coveralls

@coveralls
Copy link

Pull Request Test Coverage Report for Build 655

  • 29 of 32 (90.63%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.2%) to 96.984%

Changes Missing Coverage Covered Lines Changed/Added Lines %
quantulum3/classifier.py 29 32 90.63%
Totals Coverage Status
Change from base Build 654: -0.2%
Covered Lines: 1222
Relevant Lines: 1260

💛 - Coveralls

@nielstron
Copy link
Owner

The changes look good! However I feel that this procedure should be available to non-classifier disambiguation as well and hence lie a layer above the current changes.

Further I would prefer to have systematic changes (i.e. swapping all letters or searching for the unit with least differences in capitalization) rather than swapping only the last letter which seems quite random and inconsistent to me (what about the unit "Km", we should swap the first letter here for a perfect match. What about "litre", is it useful to check for "litrE"?)

@nielstron
Copy link
Owner

Please keep changes unrelated to the specific issue out of this pull request. Seperate codestyle improving pull requests will be warmly welcomed

@hwalinga
Copy link
Contributor Author

OK,

I created #164 for all my import sorting and there was a block not black formatted.

I have moved the logic one layer up and also implemented it for no classifier.

There is no need to consider any other case than the last letter (unit letter) to swap the case of, because it is only the first letter that causes the ambiguity because of the ambiguity in casing of the SI-prefixes. (milli and Mega, and pico and Peta)

I did check if the unit was longer than 2 letters and disabled that additional case swapping for those.

@nielstron
Copy link
Owner

nielstron commented Sep 16, 2020

There is no need to consider any other case than the last letter (unit letter) to swap the case of, because it is only the first letter that causes the ambiguity because of the ambiguity in casing of the SI-prefixes. (milli and Mega, and pico and Peta)

If you really only want to apply this check to prefixed units, we should check whether the unit has any SI-prefix and only apply this workaround in that case.

On the other hand, the two letter check is too specific, consider for example the unit Bar. If someone writes "1 MBAR" it is currently either a milli- or a Mega-Bar, but according to this change it should be considered a Megabar preferably (on my device, rather deterministically, millibar is returned)

@hwalinga
Copy link
Contributor Author

Good suggestions. I applied them.

@hwalinga
Copy link
Contributor Author

hwalinga commented Sep 16, 2020

Hmmm.. There is no Mbar in units.json, only a mbar and a μbar. So, it won't work anyway, but at least the code for it is now in place.

@nielstron
Copy link
Owner

nielstron commented Sep 24, 2020

There is no Mbar in units.json

I think this should be fixed then since the goal would be to support all Wikipedia featured units 😅

To fix the reduction in coverage, you would need to add a test case that triggers the branch you just added.

@hwalinga
Copy link
Contributor Author

Your test suite is a bit weird. Automatically loading a bunch of test cases is clever, but you are better off designing the tests yourself so that you can have tests that are specific to certain edge cases. Or at least have those in addition to the automated tests. Now for example when one loaded test fails, all are flagged as failed. When a test fails you want to know what the specific problem is.

In addition to that as a tip, if you load multiple cases and test in a loop, at least use subtests:

https://docs.python.org/3/library/unittest.html#distinguishing-test-iterations-using-subtests

Anyway, without any proper example on how to write a test for an edge case from your side it is very hard for me to entangle your code to write a test case that works. I attempted one, but I cannot create a Unit that is equal somehow. Can you take a look what I am doing wrong?

@nielstron
Copy link
Owner

I see your point about the test suite. Thanks for pointing me to the subtest system. I will have a look into it. Related to #157 there will likely be also a class of test cases that merely checks whether an error is thrown or not but ignores the outcome.

Regarding your issue, I will have a look soon but am currently rather busy. Until now I've had trouble to find the cause of test failures myself.

@nielstron nielstron mentioned this pull request Sep 30, 2020
2 tasks
@hwalinga
Copy link
Contributor Author

hwalinga commented Oct 1, 2020

OK, ping me when you updated the test suite, than I can see how I can learn from that to make tests for this PR.

@nielstron
Copy link
Owner

I enhanced the test suite, feel free to have a look at the changes of #171

@hwalinga
Copy link
Contributor Author

The test suite still misses a simple single test like I would want to add as well. It still all obscure automated testing, now with subtests though.

To repeat what I said before, a proper test suite would test specific different cases. It is not good enough to just go grab a random bunch of cases and test them all and if you succeed more than 80 % call it good enough.

With this PR I want to test specific cases, but without an example I cannot get it going so easily. I tried, if you look at this PR I have a test, but it fails, and since I don't have an example to copy from I don't know what I am doing wrong. I will paste it here in the comment as well for you to see:

So what is wrong here?

class ClassifierTestEdgeCases(unittest.TestCase):
    """Test suite that tests certain specifically designed edge cases to confuse the classifier."""

    def test_wrong_capitalization(self):
        parsed_unit = p.parse("1nw")
        unit = Quantity(
            value=1,
            unit=Unit(
                name="nanowatt",
                entity=Entity("power"),
                dimensions=[{"base": "nanowatt", "power": 1, "surface": "nW"}],
            ),
        )
        self.assertEqual(parsed_unit, unit)

PS. The steps under contributing are not complete, because the test suite requires some more requirements.

@nielstron
Copy link
Owner

Note that there are two distinct classes of test cases. One class contains ambiguous cases, one class contains cases that (given the rules provided) should always output a certain output. For the first class, only a percentage of classes is required to be recognized correctly. For the second class the output has to match correctly.

Your change affects the classifier (ambiguous cases) but it is possible to create cases where the output is deterministic. Thereforce the test case should be added to the latter class. It is simply added by extending "tests/quantities.json" with a sample sentence (in your case for example "The energy in this battery is 1nW") and the expected units parsed (see the other examples in the file). I hope that helps.

@hwalinga
Copy link
Contributor Author

Yes, but the problem of that approach is that those tests are not isolated. (1) I cannot run a single test that gives problems. (2) I have to run the whole suite every time. (3) I don't see what specific test case is testing.

It is not a bad way to run your tests like that, but having specific designed tests that test a specific feature of the code would help a lot. Here for example I specifically wrote a test that does test_wrong_capitalization. If I put that in that json all the context and isolation is gone.

So, how can I run a single test like the way I showed in my previous comment.

(Also if you stick to the default way of writing tests people can add their own tests. It is now unclear how your test suite works.)

@nielstron
Copy link
Owner

The way I personally handle this is to create a file (i.e. "test.py") and write in it the desired test case behavior (i.e. parser.parse("...") == [...]). A number of IDEs support running this in debug mode as does pdb

Then if everything works as expected I add the required test to the json file.

I agree that this is not optimal procedure. However with the sheer amount of test cases and possible interactions of rules it seems to me that the checking cannot be nailed down to a specific rule and has to be considered holistically.

@hwalinga
Copy link
Contributor Author

OK, well I don't think that the best way to do it, but I think I already made my point. It would be a good idea to make some sort of contributing doc in which you explain how somebody contributing can add a test case.

I have added one test case. The other cases discussed here cannot be added since units.json misses things like MegaBar and MegaLitre.

@hwalinga
Copy link
Contributor Author

I am not so sure what is going on with "1 '2". What triggers that to be parsed? Anyway, I wrote a work around.

Also, just a tip, don't use try: ... except Exception as e: self.fail(...) in a test case. You lose the stacktrace during debugging. (pylint does not flag that as wrong for nothing.)

@hwalinga
Copy link
Contributor Author

@nielstron Would you still be able to take a look at this?

return None


def get_a_better_one(old, new):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is really not saying anything. I also don't get what it does - Why is the length of units important? Is there not a more concise way to filter out None (potentially at the source of creating it)?

base = no_clf.disambiguate_no_classifier(base, text, lang)
elif len(base) == 1:
base = next(iter(base))
# If the unit is longer than two prefixes, we set everything to lower
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole following flow is convoluted and not concise. I don't really see how there is not a ton of issues here.

  • If a unit has no prefix but starts with "m" (like meter) this messes up stuff
  • Why do we selectively lower/swapcase certain parts of the surface? Should we not try out all possible capitalizations?
  • Why set everything to lower for units longer than two letters (nothing known about prefixes)?

In general, this whole flow needs more explanation like "units that have more than two letters are most likely case insensitive. Only the capitalization of the prefix matters. We therefore only try out lowering the last letters"

@@ -1373,5 +1373,15 @@
"surface": "three million, two hundred & forty"
}
]
},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need way more test cases to check if this (rather big) change in the flow works as expected. Please include one test case for every edge case you have considered (i.e. more than two letters, metric prefix, no metric prefix, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be nice, but that would mean that units.json needs to be updated. It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.

It is weird that some units don't have all prefixes, and some units have more prefixes defined than others. Is it really a good idea to hardcode what prefixes belong to what units? Why don't have all prefixes available to all units?

Copy link
Owner

@nielstron nielstron Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't have all prefixes available to all units?

Because one of the criteria for a unit to be listed is for it to be common.
In our case, that is defined by having a Wikipedia page (or being redirected to it).
This is not the case for all units (Take for example "Deci-tonne").
If you find a prefix-unit combination that has a Wikipedia page and is missing from units.json, feel free to add it.

Copy link
Owner

@nielstron nielstron Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.

Megalitre is correctly recognized by quantulum. Mlitre and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre" (Please post one here as a comment). Adding <prefix><surface> is then trivial by manipulating the load method

Copy link
Owner

@nielstron nielstron Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about mbar and Mbar? Please add test cases for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the case for all units (Take for example "Deci-tonne").

Yes, because a tonne is a different kind of unit. It is just MegaGram. We don't have to support Tonne -> MG conversion, that is clearly a non-goal, but we can support full prefix for units for which this is generally expected, such as SI units. If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this. (But I think a lot of people expect all SI units to just work with all prefixes.)

Megalitre is correctly recognized by quantulum. Mlitre and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre"

Sure, but you have asked me in the past to support such "nonsensical" cases, so that's why I added support for that and wanted to add test cases for that:

(what about the unit "Km", we should swap the first letter here for a perfect match. What about "litre", is it useful to check for "litrE"?)

.... sensible body of text that uses "MLitre" (Please post one here as a comment)

Well apparently Mlitre isn't as nonsensical as it me seems, here you have it:

Secondary sewage from the Hyperion works is now used in a variety of ways to alleviate this. In one application, commissioned in 1997, Memcor processes are used in 15 Mlitre/day CMF/RO system to supply the nearby Mobil and Chevron refineries with feed for their boiler water plants. In an earlier project Memcor supplied an 11.5 Mlitre/day CMF plant as pretreatment for RO to provide water used for injection in a barrier scheme to keep out further sea water.

Source: https://www.climate-policy-watcher.org/water-quality/v-1.html

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will look into that :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the issue of recognizing units like "Mlitre" to #183

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this.

The list of expectations for supported units is pretty clear in my opinion. But it may be considered to add options like "all_SI" or "all_Bin" for adding a certain set of prefixes to a unit. I moved this to #184

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I will probably work on improving the flow for this PR when we I can also get these test cases available (because, yes, the flow is sometimes quite chaotic). Ping me when it gets ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quantulum incorrect for prefixes that are not capitilized if the unit is incorrectly capitilized
3 participants