Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some quirks when parsing a general text... #135

Open
psychemedia opened this issue Aug 6, 2019 · 7 comments
Open

Some quirks when parsing a general text... #135

psychemedia opened this issue Aug 6, 2019 · 7 comments

Comments

@psychemedia
Copy link

psychemedia commented Aug 6, 2019

I wrote a simple story and it threw up some interesting numbers...

text = '''
Once upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250. 
It was blue. It took forty five minutes to get it home. 
What a day that was. I didn't get back until 2.15pm. Then I had cake for tea.
'''

parser.inline_parse_and_expand(text)

returns:

"\nOnce upon one instance, there was a thing. The thing weighed forty kilograms and cost two hundred and fifty pounds sterling, zero pence. \nIt was blue. It took forty-five minutes to get it home. \nWhat one day that was. I didn't get back until two point one five picometres. Then I had cake for tea.\n"

and parser.parse(text) returns:

[Quantity(1, "Unit(name="count", entity=Entity("dimensionless"), uri=Count_data)"), Quantity(40, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)"), Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)"), Quantity(45, "Unit(name="minute of arc", entity=Entity("angle"), uri=Minute_and_second_of_arc)"), Quantity(1, "Unit(name="day", entity=Entity("time"), uri=Day)"), Quantity(2.15, "Unit(name="picometre", entity=Entity("length"), uri=Picometre)")]

@nielstron
Copy link
Owner

nielstron commented Aug 7, 2019

There are some legitimate problems with the parse output of this text, thanks for the sample! I will have a look into certain issues.

  • "a thing" results in "1 count", which is actually not that wrong...
  • pm/am are interpreted as pico-/attometres rather than time delimiters
  • "it took 45 minutes" is interpreted as minutes of arc

Disambiguation is not perfect yet as shown by the "minute of arc" interpretation. Still working on improving this...

@psychemedia
Copy link
Author

psychemedia commented Aug 11, 2019

In passing, I also just spotted this natural language time parsing package — ctparse — but I've not had a chance to play with it yet.

@alberto-bracci
Copy link

I had several similar issues. The weirdest being 'PayPal' being parsed into 'petayear year petayear litre'. Is there a way to force quantulum to just basic units and not try to guess these combinations? Or any way to change its behavior to adapt it to my situation.

@nielstron
Copy link
Owner

I agree, a parameter to disable parsing non-space-seperated combined units should be passed. Also maybe passing a list of custom (application specific) words that are not be interpreted as units.
PRs addressing this are welcome, otherwise I might at some point find the time to implement this myself :)

@alberto-bracci
Copy link

I'll see whether I can find the time to do it. On another note: the only way to add custom units is to edit the entities.json or units.json files? Or is there a way to do it from python?

@nielstron
Copy link
Owner

Currently this is the easiest way without changing the source code of the project.
You can of course add your own entities and units by manipulating the cached Entities and Units objects stored in the _CACHE_DICT in load.py

@nielstron
Copy link
Owner

@alberto-bracci with #186 there will be an option to add custom entities and units to quantulum3 without any hassle :) sorry for the delay but this required some reworking of inner quantulum structure that was pending anyways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants