Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about performance and syntax #141

Closed
trantor opened this issue Jun 22, 2021 · 6 comments
Closed

Question about performance and syntax #141

trantor opened this issue Jun 22, 2021 · 6 comments
Labels
question A question that has or needs further clarification

Comments

@trantor
Copy link
Contributor

trantor commented Jun 22, 2021

Hello.

I've been trying out ugrep with a simple match from STDIN on logs (around 600k lines in input).

There are a couple of things that are not clear to me.

First, in the POSIX pattern syntax there's support for captures but there's no way to use them, even as backreferences in the pattern, so why the support for capturing and non-capturing groups? Is it just a documentation problem?

Second, about performance.

I've noticed that either grep -P or pcregrep are noticeably faster than ugrep (in my case 0.75s and 0.4s for the first two vs 1.1s for ugrep, with ~14s ! using ugrep -P ).

Although I like the feature set of ugrep a lot, the difference in performance is rather worrying, especially considering that my use case would be to look for matches across several GBs of compressed log files.

Is there something glaring I am missing?

Regards

@genivia-inc
Copy link
Member

Backreferences are at this time incomplete and only possible for separate alternations e.g. (a)|(b), it's mentioned in the README. But this will be addressed in near-future updates of the RE/flex library that has this feature planned.

I don't understand how ugrep -P can be slower than grep -P or pcregrep. It wasn't the case a few weeks ago. Perhaps something has changed in PCRE2 or ugrep. I will take a look into this. The internals are pretty much the same. Also the JIT feature is used with PCRE2.

@genivia-inc
Copy link
Member

I have this comment in pcre2matcher.h from many months ago:

      /* TODO this may be a bug in PCRE2? If JIT is used to compile the
         pattern for complete and partial matching (these are not exclusive):
         pcre2_jit_compile(opc_, PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
         then pcre2_jit_match() without option PCRE2_PARTIAL_HARD always
         returns error PCRE2_ERROR_JIT_BADOPTION when we actually want the
         complete match when the end of the input is reached.  A final complete
         match is required, because we cannot tell whether the final partial
         match is also a complete match.  Therefore, pcre2_jit_match() is
         unusable and disabled below in favor of pcre2_match().
         */
#if 0
      if (jit_ && !(flg & PCRE2_ANCHORED))
        rc = pcre2_jit_match(opc_, reinterpret_cast<PCRE2_SPTR>(buf_), end_, pos_, flg, dat_, ctx_);
      else
#endif
        rc = pcre2_match(opc_, reinterpret_cast<PCRE2_SPTR>(buf_), end_, pos_, flg, dat_, ctx_);

The condition that is most optimal is never executed, because earlier PCRE2 versions produced an error.

I tested with the latest PCRE2 10.37 and it appears to work now (at least in a quick test run).

There is a big performance difference when many patterns match. There is not much difference if few patterns match. Without the optimized condition, the time to returns results with many matches is significantly longer:

time ugrep -P -c serialize data/big.cpp
20743
5.886u 0.014s 0:05.90 99.8%	0+0k 0+0io 0pf+0w

With the optimized condition, this runs super fast as expected:

time src/ugrep -P -c serialize data/big.cpp
20743
0.033u 0.012s 0:00.04 100.0%	0+0k 0+0io 0pf+0w

The big.cpp file is about 36MB.

I hope we can enable the optimized if-condition branch, since the PCRE2 bug happened more than a year ago, from the top of my head. However, I need to test more to assess if this is indeed the case.

@genivia-inc genivia-inc added the question A question that has or needs further clarification label Jun 22, 2021
@genivia-inc
Copy link
Member

Well, after checking a bit more, for some reason some patterns do not match when the optimized condition is enabled. There should not be any difference, because the two calls pcre2_jit_match and pcre2_match should produce the same matches (according to PCRE2), but they aren't:

ugrep -P '' tests/Hello.txt

returns nothing, when it should match everything (in this case just the word Hello without newline in Hello.txt).

So strange and worrisome to see this happen with PCRE2. I need to dig deeper.

@genivia-inc
Copy link
Member

I believe some progress has been made as all ugrep tests pass cleanly as of this evening, including the extensive RE/flex tests and additional edge cases. I don't remember why I could not get rid of the PCRE2_ERROR_JIT_BADOPTION problem months ago, despite the excellent PCRE2 documentation. I did notice that the PCRE2_JIT_COMPLETE flag was not used with pcre2_jit_compile, which the documentation (now?) specifically states should be used in addition to PCRE2_JIT_PARTIAL_HARD. That essentially solved the problem. Now pcre2_jit_match works as expected and has no issues running in many concurrent threads:

$ time src/ugrep -P -c serialize data/big.cpp data/big.cpp data/big.cpp data/big.cpp
data/big.cpp:20743
data/big.cpp:20743
data/big.cpp:20743
data/big.cpp:20743
0.132u 0.030s 0:00.04 400.0%	0+0k 0+0io 0pf+0w

Optimal parallel efficiency of 400% to search in 0.04s running (wall-clock) time, the same time as searching one file:

$ time src/ugrep -P -c serialize data/big.cpp
20743
0.031u 0.011s 0:00.04 100.0%	0+0k 0+0io 0pf+0w

Nice!

@genivia-inc
Copy link
Member

Updated to v3.3.4.

@genivia-inc
Copy link
Member

Let me add that capturing groups and backreferences work fine with -P and will be available with the default POSIX regex matching in ugrep in the future.

The plan is to use a much more efficient algorithm for capturing groups (and backreferences) compared to GNU grep. This is work in progress with the RE/flex library, see Genivia/RE-flex#95. On the other hand, I could "cheat" and use another regex library, but one that is less efficient than RE/flex. I am not sure I want to go there. What is the fun in that?

Important things come first! Ugrep needs to be very reliable, efficient and equipped with the essential features we wanted, which seems to be achieved (or about) IMHO. I hope people will continue to report issues and deviations from GNU grep or questions, so we can be sure. Then it's time to tackle the few remaining items that require more R&D effort such as efficient capturing groups and backreferences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A question that has or needs further clarification
Projects
None yet
Development

No branches or pull requests

2 participants