Incorrectly parsing UTF-8 words #223

michalpodrouzek · 2024-10-22T14:35:15Z

Hello,

We've had an issue with the parser, it was working correctly for most languages, but we've noticed that it incorrectly parses words in Czech. For example, we had a term CI and the parser was parsing the word zákazníci like zákazníci.

For anyone who happens to have this issue, we've come to a solution to give the regex an additional flag /u to the regex pattern in ParserService.
Here is a patch for this:

`
diff --git a/Classes/Service/ParserService.php b/Classes/Service/ParserService.php
--- a/Classes/Service/ParserService.php (revision 29da54f)
+++ b/Classes/Service/ParserService.php (date 1729607337701)
@@ -580,7 +580,7 @@
'($|[\s<[:punct:]]|<br*>' . self::$additionalRegexWrapCharacters . ')' .
'(?![^<]>|[^<>]</)' .
'#' .

       ($term->isCaseSensitive() ? '' : 'i');

       ($term->isCaseSensitive() ? '' : 'i') . 'u';

   // replace callback
   $callback = function (array $match) use (

`

Thanks for this extension :)

The text was updated successfully, but these errors were encountered:

featdd · 2024-11-17T15:56:26Z

Hi @michalpodrouzek,

I have to check if adding this produces issues on some other places, there were issues with the case of umlauts as well.

Greetings
Daniel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrectly parsing UTF-8 words #223

Incorrectly parsing UTF-8 words #223

michalpodrouzek commented Oct 22, 2024 •

edited

Loading

featdd commented Nov 17, 2024

Incorrectly parsing UTF-8 words #223

Incorrectly parsing UTF-8 words #223

Comments

michalpodrouzek commented Oct 22, 2024 • edited Loading

featdd commented Nov 17, 2024

michalpodrouzek commented Oct 22, 2024 •

edited

Loading