I have tried to use Perl Compatible Regular Expressions (PCRE) by
All have the same error. They read Latin letters with diacritics as word boundary.
Some of them wrongly do not match them as lower letters.
It should not be error of encoding, because I have set UTF-8 in all locale variables:Testing file:IMHO wrong results:
Versions of commands and libraries:
(Of course, I have switched
Of course, I use single letters as single UniCode characters, not composites with Combining Diacritical Marks.
Do I make some error? Or is it bug in libraries (it seems to me to be improbable)?
grep -P
, also by pcre2grep
and pcregrep
.All have the same error. They read Latin letters with diacritics as word boundary.
Some of them wrongly do not match them as lower letters.
It should not be error of encoding, because I have set UTF-8 in all locale variables:
Code:
$ printenv|grep -P '^L[AC]'|sortLANG=sk_SK.UTF-8LC_ADDRESS=sk_SK.UTF-8LC_IDENTIFICATION=sk_SK.UTF-8LC_MEASUREMENT=sk_SK.UTF-8LC_MONETARY=sk_SK.UTF-8LC_NAME=sk_SK.UTF-8LC_NUMERIC=sk_SK.UTF-8LC_PAPER=sk_SK.UTF-8LC_TELEPHONE=sk_SK.UTF-8LC_TIME=sk_SK.UTF-8
Code:
$ cat diakritika.txt -čí-čia-čo-Evička-Košice-ký-mám-úži-Žiar-42úver
Code:
$ grep -P '\b\p{Ll}{2}' diakritika.txt -čia-Evička-Košice-ký-mám-Žiar-42úver
Code:
$ pcregrep '\b\p{Ll}{2}' diakritika.txt -čia-Evička-Košice-Žiar-42úver
Code:
$ pcre2grep '\b\p{Ll}{2}' diakritika.txt -čia-Evička-Košice-Žiar-42úver
Code:
$ LANG=C; grep --versiongrep (GNU grep) 3.11⋮grep -P uses PCRE2 10.44 2024-06-07
Code:
$ pcregrep --versionpcregrep version 8.39 2016-06-14
Code:
$ pcre2grep --versionpcre2grep version 10.44 2024-06-07
Code:
$ bash --versionGNU bash, version 5.2.37(1)-release (x86_64-pc-linux-gnu)
LANG
to C
after the testing, due language used in printing of versions.)grep -P
and pcre2grep
seem to use the same library, but they give different results;-ký
& -mám
are correct answers, but pcregrep
& pcre2grep
do not match them;-úži
should be in results, but no commands match it.-Evička
, -Košice
, -Žiar
, -42úver
are all false results, because they begin by upper-case letter or digit (digits are considered to be word characters by Regular-Expressions.info: Word Boundaries).Of course, I use single letters as single UniCode characters, not composites with Combining Diacritical Marks.
Do I make some error? Or is it bug in libraries (it seems to me to be improbable)?
Statistics: Posted by ruwolf — 2025-01-02 18:30 — Replies 0 — Views 21