Quantcast
Channel: Debian User Forums
Viewing all articles
Browse latest Browse all 3267

Programming • [Bash] PCRE: Latin letters with diacritics false match and not-match

$
0
0
I have tried to use Perl Compatible Regular Expressions (PCRE) by grep -P, also by pcre2grep and pcregrep.
All have the same error. They read Latin letters with diacritics as word boundary.
Some of them wrongly do not match them as lower letters.

It should not be error of encoding, because I have set UTF-8 in all locale variables:

Code:

$ printenv|grep -P '^L[AC]'|sortLANG=sk_SK.UTF-8LC_ADDRESS=sk_SK.UTF-8LC_IDENTIFICATION=sk_SK.UTF-8LC_MEASUREMENT=sk_SK.UTF-8LC_MONETARY=sk_SK.UTF-8LC_NAME=sk_SK.UTF-8LC_NUMERIC=sk_SK.UTF-8LC_PAPER=sk_SK.UTF-8LC_TELEPHONE=sk_SK.UTF-8LC_TIME=sk_SK.UTF-8
Testing file:

Code:

$ cat diakritika.txt -čí-čia-čo-Evička-Košice-ký-mám-úži-Žiar-42úver
IMHO wrong results:

Code:

$ grep -P '\b\p{Ll}{2}' diakritika.txt -čia-Evička-Košice-ký-mám-Žiar-42úver

Code:

$ pcregrep '\b\p{Ll}{2}' diakritika.txt -čia-Evička-Košice-Žiar-42úver

Code:

$ pcre2grep '\b\p{Ll}{2}' diakritika.txt -čia-Evička-Košice-Žiar-42úver
Versions of commands and libraries:

Code:

$ LANG=C; grep --versiongrep (GNU grep) 3.11⋮grep -P uses PCRE2 10.44 2024-06-07

Code:

$ pcregrep --versionpcregrep version 8.39 2016-06-14

Code:

$ pcre2grep --versionpcre2grep version 10.44 2024-06-07

Code:

$ bash --versionGNU bash, version 5.2.37(1)-release (x86_64-pc-linux-gnu)
(Of course, I have switched LANG to C after the testing, due language used in printing of versions.)
grep -P and pcre2grep seem to use the same library, but they give different results;
-ký & -mám are correct answers, but pcregreppcre2grep do not match them;
-úži should be in results, but no commands match it.
-Evička, -Košice, -Žiar, -42úver are all false results, because they begin by upper-case letter or digit (digits are considered to be word characters by Regular-Expressions.info: Word Boundaries).
Of course, I use single letters as single UniCode characters, not composites with Combining Diacritical Marks.
Do I make some error? Or is it bug in libraries (it seems to me to be improbable)?

Statistics: Posted by ruwolf — 2025-01-02 18:30 — Replies 0 — Views 21



Viewing all articles
Browse latest Browse all 3267

Trending Articles