<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body style='font-size: 10pt; font-family: Verdana,Geneva,sans-serif'>

<p>It turned out the markup filter was the one. It is supposed to convert the HTML tags in messages to plain text, and in doing so it normalized spaces, replacing all series of whitespace characters with a single, regular space. In the switch to Python 3, that started including non-breaking spaces.</p>

<p>I think space normalization is not something this filter should do at all in parts not affected by HTML tags, but that may need a discussion. For now, I have restored the Python 2 behavior of only normalizing ASCII spaces: <a href="https://invent.kde.org/sdk/pology/commit/0da46cdb3802c03b0930aeb70f781256d9fe6e69">https://invent.kde.org/sdk/pology/commit/0da46cdb3802c03b0930aeb70f781256d9fe6e69</a></p>

<p id="reply-intro">On 2022-10-08 12:21, Karl Ove Hufthammer wrote:</p>

<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">

<div class="pre" style="margin: 0; padding: 0; font-family: monospace"><span style="white-space: nowrap;">Adrian Chaves skreiv 08.10.2022 12:13:</span>

<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">I have debugged this issue and I believe the root cause is “addFilterHook name="normalize/noinvisible" on="pmsgstr" handle="noinvisible"”, defined in puretext.filters, which is included in ortography.rules. So I think this is another case where Python 3 is working as expected, and Python 2 was not.</blockquote>

<br />Hmm. The ‘noinvisible’ hook should remove only invisible characters. They are defined in normalize.py:<br /><br /><span style="white-space: nowrap;"># As defined by <a href="http://www.unicode.org/faq/unsup_char.html" target="_blank" rel="noopener noreferrer">http://www.unicode.org/faq/unsup_char.html</a>.</span><br /><span style="white-space: nowrap;">_invisible_character_codepoints = ([]</span><br /><span style="white-space: nowrap;">    + [0x200C, 0x200D] # cursive joiners</span><br /><span style="white-space: nowrap;">    + list(range(0x202A, 0x202E + 1)) # bidirectional format controls</span><br /><span style="white-space: nowrap;">    + [0x00AD] # soft hyphen</span><br /><span style="white-space: nowrap;">    + [0x2060, 0xFEFF] # word joiners</span><br /><span style="white-space: nowrap;">    + [0x200B] # the zero width space</span><br /><span style="white-space: nowrap;">    + list(range(0x2061, 0x2064 + 1)) # invisible math operators</span><br /><span style="white-space: nowrap;">    + [0x115F, 0x1160] # Jamo filler characters</span><br /><span style="white-space: nowrap;">    + list(range(0xFE00, 0xFE0F + 1)) # variation selectors</span><br /><span style="white-space: nowrap;">)</span><br /><br />But the non-breaking space (U+00A0) is not among these characters, and shouldn’t be removed (or replaced by a normal space).<br /><br /></div>

</blockquote>

<p><br /></p>


</body></html>