-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about performance and syntax #141
Comments
Backreferences are at this time incomplete and only possible for separate alternations e.g. I don't understand how |
I have this comment in
The condition that is most optimal is never executed, because earlier PCRE2 versions produced an error. I tested with the latest PCRE2 10.37 and it appears to work now (at least in a quick test run). There is a big performance difference when many patterns match. There is not much difference if few patterns match. Without the optimized condition, the time to returns results with many matches is significantly longer:
With the optimized condition, this runs super fast as expected:
The I hope we can enable the optimized if-condition branch, since the PCRE2 bug happened more than a year ago, from the top of my head. However, I need to test more to assess if this is indeed the case. |
Well, after checking a bit more, for some reason some patterns do not match when the optimized condition is enabled. There should not be any difference, because the two calls
returns nothing, when it should match everything (in this case just the word So strange and worrisome to see this happen with PCRE2. I need to dig deeper. |
I believe some progress has been made as all ugrep tests pass cleanly as of this evening, including the extensive RE/flex tests and additional edge cases. I don't remember why I could not get rid of the
Optimal parallel efficiency of 400% to search in 0.04s running (wall-clock) time, the same time as searching one file:
Nice! |
Updated to v3.3.4. |
Let me add that capturing groups and backreferences work fine with The plan is to use a much more efficient algorithm for capturing groups (and backreferences) compared to GNU grep. This is work in progress with the RE/flex library, see Genivia/RE-flex#95. On the other hand, I could "cheat" and use another regex library, but one that is less efficient than RE/flex. I am not sure I want to go there. What is the fun in that? Important things come first! Ugrep needs to be very reliable, efficient and equipped with the essential features we wanted, which seems to be achieved (or about) IMHO. I hope people will continue to report issues and deviations from GNU grep or questions, so we can be sure. Then it's time to tackle the few remaining items that require more R&D effort such as efficient capturing groups and backreferences. |
Hello.
I've been trying out
ugrep
with a simple match from STDIN on logs (around 600k lines in input).There are a couple of things that are not clear to me.
First, in the POSIX pattern syntax there's support for captures but there's no way to use them, even as backreferences in the pattern, so why the support for capturing and non-capturing groups? Is it just a documentation problem?
Second, about performance.
I've noticed that either
grep -P
orpcregrep
are noticeably faster than ugrep (in my case 0.75s and 0.4s for the first two vs 1.1s forugrep
, with ~14s ! usingugrep -P
).Although I like the feature set of ugrep a lot, the difference in performance is rather worrying, especially considering that my use case would be to look for matches across several GBs of compressed log files.
Is there something glaring I am missing?
Regards
The text was updated successfully, but these errors were encountered: