Consider capitalization #162

hwalinga · 2020-09-16T19:10:05Z

In case of ambiguous matches, this will re-attempt the algo by swapping the case of the last letter of the unit, and see if this creates a better match. It will also throw a warning if an unambiguous match cannot be found.

EDIT: Fixes #161

hwalinga · 2020-09-16T19:11:19Z

Oh, I based of master instead of dev. Should I do this again?

EDIT: I rebased.

…Fixes nielstron#161.

coveralls · 2020-09-16T19:21:08Z

Pull Request Test Coverage Report for Build 741

58 of 70 (82.86%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.8%) to 96.708%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
quantulum3/no_classifier.py	8	9	88.89%
quantulum3/disambiguate.py	33	44	75.0%

Totals
Change from base Build 734:	-0.8%
Covered Lines:	1263
Relevant Lines:	1306

💛 - Coveralls

coveralls · 2020-09-16T19:21:08Z

Pull Request Test Coverage Report for Build 655

29 of 32 (90.63%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.2%) to 96.984%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
quantulum3/classifier.py	29	32	90.63%

Totals
Change from base Build 654:	-0.2%
Covered Lines:	1222
Relevant Lines:	1260

💛 - Coveralls

nielstron · 2020-09-16T20:26:13Z

The changes look good! However I feel that this procedure should be available to non-classifier disambiguation as well and hence lie a layer above the current changes.

Further I would prefer to have systematic changes (i.e. swapping all letters or searching for the unit with least differences in capitalization) rather than swapping only the last letter which seems quite random and inconsistent to me (what about the unit "Km", we should swap the first letter here for a perfect match. What about "litre", is it useful to check for "litrE"?)

nielstron · 2020-09-16T20:27:17Z

Please keep changes unrelated to the specific issue out of this pull request. Seperate codestyle improving pull requests will be warmly welcomed

…on#163.

hwalinga · 2020-09-16T22:10:56Z

OK,

I created #164 for all my import sorting and there was a block not black formatted.

I have moved the logic one layer up and also implemented it for no classifier.

There is no need to consider any other case than the last letter (unit letter) to swap the case of, because it is only the first letter that causes the ambiguity because of the ambiguity in casing of the SI-prefixes. (milli and Mega, and pico and Peta)

I did check if the unit was longer than 2 letters and disabled that additional case swapping for those.

nielstron · 2020-09-16T22:44:48Z

There is no need to consider any other case than the last letter (unit letter) to swap the case of, because it is only the first letter that causes the ambiguity because of the ambiguity in casing of the SI-prefixes. (milli and Mega, and pico and Peta)

If you really only want to apply this check to prefixed units, we should check whether the unit has any SI-prefix and only apply this workaround in that case.

On the other hand, the two letter check is too specific, consider for example the unit Bar. If someone writes "1 MBAR" it is currently either a milli- or a Mega-Bar, but according to this change it should be considered a Megabar preferably (on my device, rather deterministically, millibar is returned)

hwalinga · 2020-09-16T22:57:54Z

Good suggestions. I applied them.

hwalinga · 2020-09-16T23:25:08Z

Hmmm.. There is no Mbar in units.json, only a mbar and a μbar. So, it won't work anyway, but at least the code for it is now in place.

nielstron · 2020-09-24T18:23:26Z

There is no Mbar in units.json

I think this should be fixed then since the goal would be to support all Wikipedia featured units 😅

To fix the reduction in coverage, you would need to add a test case that triggers the branch you just added.

hwalinga · 2020-09-25T09:08:30Z

Your test suite is a bit weird. Automatically loading a bunch of test cases is clever, but you are better off designing the tests yourself so that you can have tests that are specific to certain edge cases. Or at least have those in addition to the automated tests. Now for example when one loaded test fails, all are flagged as failed. When a test fails you want to know what the specific problem is.

In addition to that as a tip, if you load multiple cases and test in a loop, at least use subtests:

https://docs.python.org/3/library/unittest.html#distinguishing-test-iterations-using-subtests

Anyway, without any proper example on how to write a test for an edge case from your side it is very hard for me to entangle your code to write a test case that works. I attempted one, but I cannot create a Unit that is equal somehow. Can you take a look what I am doing wrong?

nielstron · 2020-09-30T21:46:21Z

I see your point about the test suite. Thanks for pointing me to the subtest system. I will have a look into it. Related to #157 there will likely be also a class of test cases that merely checks whether an error is thrown or not but ignores the outcome.

Regarding your issue, I will have a look soon but am currently rather busy. Until now I've had trouble to find the cause of test failures myself.

hwalinga · 2020-10-01T16:01:38Z

OK, ping me when you updated the test suite, than I can see how I can learn from that to make tests for this PR.

nielstron · 2020-10-19T22:22:38Z

I enhanced the test suite, feel free to have a look at the changes of #171

hwalinga · 2020-10-24T20:03:29Z

The test suite still misses a simple single test like I would want to add as well. It still all obscure automated testing, now with subtests though.

To repeat what I said before, a proper test suite would test specific different cases. It is not good enough to just go grab a random bunch of cases and test them all and if you succeed more than 80 % call it good enough.

With this PR I want to test specific cases, but without an example I cannot get it going so easily. I tried, if you look at this PR I have a test, but it fails, and since I don't have an example to copy from I don't know what I am doing wrong. I will paste it here in the comment as well for you to see:

So what is wrong here?

class ClassifierTestEdgeCases(unittest.TestCase):
    """Test suite that tests certain specifically designed edge cases to confuse the classifier."""

    def test_wrong_capitalization(self):
        parsed_unit = p.parse("1nw")
        unit = Quantity(
            value=1,
            unit=Unit(
                name="nanowatt",
                entity=Entity("power"),
                dimensions=[{"base": "nanowatt", "power": 1, "surface": "nW"}],
            ),
        )
        self.assertEqual(parsed_unit, unit)

PS. The steps under contributing are not complete, because the test suite requires some more requirements.

nielstron · 2020-10-24T22:00:06Z

Note that there are two distinct classes of test cases. One class contains ambiguous cases, one class contains cases that (given the rules provided) should always output a certain output. For the first class, only a percentage of classes is required to be recognized correctly. For the second class the output has to match correctly.

Your change affects the classifier (ambiguous cases) but it is possible to create cases where the output is deterministic. Thereforce the test case should be added to the latter class. It is simply added by extending "tests/quantities.json" with a sample sentence (in your case for example "The energy in this battery is 1nW") and the expected units parsed (see the other examples in the file). I hope that helps.

hwalinga · 2020-10-24T22:17:56Z

Yes, but the problem of that approach is that those tests are not isolated. (1) I cannot run a single test that gives problems. (2) I have to run the whole suite every time. (3) I don't see what specific test case is testing.

It is not a bad way to run your tests like that, but having specific designed tests that test a specific feature of the code would help a lot. Here for example I specifically wrote a test that does test_wrong_capitalization. If I put that in that json all the context and isolation is gone.

So, how can I run a single test like the way I showed in my previous comment.

(Also if you stick to the default way of writing tests people can add their own tests. It is now unclear how your test suite works.)

nielstron · 2020-10-25T04:50:26Z

The way I personally handle this is to create a file (i.e. "test.py") and write in it the desired test case behavior (i.e. parser.parse("...") == [...]). A number of IDEs support running this in debug mode as does pdb

Then if everything works as expected I add the required test to the json file.

I agree that this is not optimal procedure. However with the sheer amount of test cases and possible interactions of rules it seems to me that the checking cannot be nailed down to a specific rule and has to be considered holistically.

hwalinga · 2020-10-27T13:05:20Z

OK, well I don't think that the best way to do it, but I think I already made my point. It would be a good idea to make some sort of contributing doc in which you explain how somebody contributing can add a test case.

I have added one test case. The other cases discussed here cannot be added since units.json misses things like MegaBar and MegaLitre.

hwalinga · 2020-10-27T13:26:44Z

I am not so sure what is going on with "1 '2". What triggers that to be parsed? Anyway, I wrote a work around.

Also, just a tip, don't use try: ... except Exception as e: self.fail(...) in a test case. You lose the stacktrace during debugging. (pylint does not flag that as wrong for nothing.)

hwalinga · 2020-11-30T20:00:52Z

@nielstron Would you still be able to take a look at this?

nielstron · 2020-12-01T09:07:17Z

quantulum3/disambiguate.py

+        return None
+
+
+def get_a_better_one(old, new):


This method name is really not saying anything. I also don't get what it does - Why is the length of units important? Is there not a more concise way to filter out None (potentially at the source of creating it)?

nielstron · 2020-12-01T09:11:24Z

quantulum3/disambiguate.py

-            base = no_clf.disambiguate_no_classifier(base, text, lang)
-        elif len(base) == 1:
-            base = next(iter(base))
+    # If the unit is longer than two prefixes, we set everything to lower


The whole following flow is convoluted and not concise. I don't really see how there is not a ton of issues here.

If a unit has no prefix but starts with "m" (like meter) this messes up stuff

Why do we selectively lower/swapcase certain parts of the surface? Should we not try out all possible capitalizations?

Why set everything to lower for units longer than two letters (nothing known about prefixes)?

In general, this whole flow needs more explanation like "units that have more than two letters are most likely case insensitive. Only the capitalization of the prefix matters. We therefore only try out lowering the last letters"

nielstron · 2020-12-01T09:13:15Z

quantulum3/_lang/en_US/tests/quantities.json

@@ -1373,5 +1373,15 @@
          "surface": "three million, two hundred & forty"
        }
      ]
+  },


We probably need way more test cases to check if this (rather big) change in the flow works as expected. Please include one test case for every edge case you have considered (i.e. more than two letters, metric prefix, no metric prefix, ...)

Yes, that would be nice, but that would mean that units.json needs to be updated. It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.

It is weird that some units don't have all prefixes, and some units have more prefixes defined than others. Is it really a good idea to hardcode what prefixes belong to what units? Why don't have all prefixes available to all units?

Why don't have all prefixes available to all units?

Because one of the criteria for a unit to be listed is for it to be common.
In our case, that is defined by having a Wikipedia page (or being redirected to it).
This is not the case for all units (Take for example "Deci-tonne").
If you find a prefix-unit combination that has a Wikipedia page and is missing from units.json, feel free to add it.

It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.

Megalitre is correctly recognized by quantulum. Mlitre and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre" (Please post one here as a comment). Adding <prefix><surface> is then trivial by manipulating the load method

What about mbar and Mbar? Please add test cases for them.

This is not the case for all units (Take for example "Deci-tonne").

Yes, because a tonne is a different kind of unit. It is just MegaGram. We don't have to support Tonne -> MG conversion, that is clearly a non-goal, but we can support full prefix for units for which this is generally expected, such as SI units. If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this. (But I think a lot of people expect all SI units to just work with all prefixes.)

Megalitre is correctly recognized by quantulum. Mlitre and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre"

Sure, but you have asked me in the past to support such "nonsensical" cases, so that's why I added support for that and wanted to add test cases for that:

(what about the unit "Km", we should swap the first letter here for a perfect match. What about "litre", is it useful to check for "litrE"?)

.... sensible body of text that uses "MLitre" (Please post one here as a comment)

Well apparently Mlitre isn't as nonsensical as it me seems, here you have it:

Secondary sewage from the Hyperion works is now used in a variety of ways to alleviate this. In one application, commissioned in 1997, Memcor processes are used in 15 Mlitre/day CMF/RO system to supply the nearby Mobil and Chevron refineries with feed for their boiler water plants. In an earlier project Memcor supplied an 11.5 Mlitre/day CMF plant as pretreatment for RO to provide water used for injection in a barrier scheme to keep out further sea water.

Source: https://www.climate-policy-watcher.org/water-quality/v-1.html

Thanks, I will look into that :)

I moved the issue of recognizing units like "Mlitre" to #183

If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this.

The list of expectations for supported units is pretty clear in my opinion. But it may be considered to add options like "all_SI" or "all_Bin" for adding a certain set of prefixes to a unit. I moved this to #184

Sounds good! I will probably work on improving the flow for this PR when we I can also get these test cases available (because, yes, the flow is sometimes quite chaotic). Ping me when it gets ready.

Add code that guesses a different capitalization in case of ambiguity. …

f61b04d

…Fixes nielstron#161.

hwalinga force-pushed the consider-capitalization branch from 1daa536 to f61b04d Compare September 16, 2020 19:16

Fixed some unreachable log line.

e0b0596

nielstron mentioned this pull request Sep 16, 2020

Nondeterministic disambiguation #163

Open

hwalinga added 5 commits September 16, 2020 22:37

Slightly clearer what throws what now.

47fc9ff

Move capitalization logic one layer up.

368435b

Have determistic output when multiple things match. Addresses nielstr…

94a6afd

…on#163.

No classifier does not raise KeyError but resorts to 'unk' instead.

8095df2

Small reference error bug.

d11c1cd

hwalinga added 2 commits September 17, 2020 00:17

Removing notion of random.

17237f7

Sorting not supported on 'Unit'.

7ad9ecb

hwalinga added 2 commits September 17, 2020 00:50

Consider capitalization for units longer than 2 letters.

22e163c

Changed some conditions that altered capitalization work around.

336fae5

nielstron mentioned this pull request Sep 24, 2020

Add SI-derived units of Bar #165

Closed

Attempt in trying to add a test.

5cd0192

nielstron mentioned this pull request Sep 30, 2020

Improve test suite #167

Closed

2 tasks

hwalinga added 2 commits October 24, 2020 21:26

Resolved merge conflicts.

4567137

Solve merge conflict

ec38051

hwalinga added 2 commits October 27, 2020 13:48

Add test case.

4d70c3d

Added test case.

d387e79

hwalinga added 2 commits October 27, 2020 14:10

Remove old test case.

633a4b2

Solve strange edge case.

b941b15

hwalinga added 2 commits October 27, 2020 14:41

Pylint issues.

0e8cafa

Black reformat.

378ef5f

nielstron requested changes Dec 1, 2020

View reviewed changes

nielstron mentioned this pull request Dec 5, 2020

Add all prefixes of some subset of the prefixes to certain units #184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider capitalization #162

Consider capitalization #162

hwalinga commented Sep 16, 2020 •

edited

Loading

hwalinga commented Sep 16, 2020 •

edited

Loading

coveralls commented Sep 16, 2020 •

edited

Loading

coveralls commented Sep 16, 2020

nielstron commented Sep 16, 2020

nielstron commented Sep 16, 2020

hwalinga commented Sep 16, 2020

nielstron commented Sep 16, 2020 •

edited

Loading

hwalinga commented Sep 16, 2020

hwalinga commented Sep 16, 2020 •

edited

Loading

nielstron commented Sep 24, 2020 •

edited

Loading

hwalinga commented Sep 25, 2020

nielstron commented Sep 30, 2020

hwalinga commented Oct 1, 2020

nielstron commented Oct 19, 2020

hwalinga commented Oct 24, 2020

nielstron commented Oct 24, 2020

hwalinga commented Oct 24, 2020

nielstron commented Oct 25, 2020

hwalinga commented Oct 27, 2020

hwalinga commented Oct 27, 2020

hwalinga commented Nov 30, 2020

nielstron Dec 1, 2020

nielstron Dec 1, 2020

nielstron Dec 1, 2020

hwalinga Dec 2, 2020

nielstron Dec 4, 2020 •

edited

Loading

nielstron Dec 4, 2020 •

edited

Loading

nielstron Dec 4, 2020 •

edited

Loading

hwalinga Dec 4, 2020

nielstron Dec 5, 2020

nielstron Dec 5, 2020

nielstron Dec 5, 2020

hwalinga Dec 5, 2020

Consider capitalization #162

Are you sure you want to change the base?

Consider capitalization #162

Conversation

hwalinga commented Sep 16, 2020 • edited Loading

hwalinga commented Sep 16, 2020 • edited Loading

coveralls commented Sep 16, 2020 • edited Loading

Pull Request Test Coverage Report for Build 741

💛 - Coveralls

coveralls commented Sep 16, 2020

Pull Request Test Coverage Report for Build 655

💛 - Coveralls

nielstron commented Sep 16, 2020

nielstron commented Sep 16, 2020

hwalinga commented Sep 16, 2020

nielstron commented Sep 16, 2020 • edited Loading

hwalinga commented Sep 16, 2020

hwalinga commented Sep 16, 2020 • edited Loading

nielstron commented Sep 24, 2020 • edited Loading

hwalinga commented Sep 25, 2020

nielstron commented Sep 30, 2020

hwalinga commented Oct 1, 2020

nielstron commented Oct 19, 2020

hwalinga commented Oct 24, 2020

nielstron commented Oct 24, 2020

hwalinga commented Oct 24, 2020

nielstron commented Oct 25, 2020

hwalinga commented Oct 27, 2020

hwalinga commented Oct 27, 2020

hwalinga commented Nov 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nielstron Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

nielstron Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

nielstron Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwalinga commented Sep 16, 2020 •

edited

Loading

hwalinga commented Sep 16, 2020 •

edited

Loading

coveralls commented Sep 16, 2020 •

edited

Loading

nielstron commented Sep 16, 2020 •

edited

Loading

hwalinga commented Sep 16, 2020 •

edited

Loading

nielstron commented Sep 24, 2020 •

edited

Loading

nielstron Dec 4, 2020 •

edited

Loading

nielstron Dec 4, 2020 •

edited

Loading

nielstron Dec 4, 2020 •

edited

Loading