. A list of recently commented pages is also available.KrzysztofTrybowski, thanks for the fix - I changed your Trac permissions, can you update the ticket now?
agreed, this must be fixed. Yet, it is a trivial fix.
This is fixed by me in [1725]. Massdelete wouldn't work, due to this erroneous construction:
foreach($_GET as $key)
which was replaced with correct one:
foreach($_GET as $key => $val)
In the first case the variable within foreach would in fact hold a VALUE not a KEY.
I am unable to change the status of this ticket though.
Script converts pages, notes and comments what were encoded as UTF-8 twice into correct UTF-8. So this fixes the double-unicoded issue. ATTN: if run on correct UTF-8 database, will destroy it!
The script doesn't convert “double-unicoded” strings, that are a result of using pre-1.3 Wikka with template modified to use UTF-8 HTML charset declaration.
Such fix can be done this way:
$body = mb_convert_encoding($body, 'Windows-1252', 'UTF-8');
...but it would destroy the data if ran on a proper utf-8 database. Perhaps we should provide a script for that, but it should first run some checks and issue warnings etc.
Also I dislike the fact that the script creates new revisions of pages. Charset conversion is a purely technical task, and thus shouldn't leave any trace in the data.
Do you mean “convertHTMLEntities.php”? If so, then I just tested it, and:
* the script works well in case when non-ASCII characters were typed in unmodified Wikka 1.2 and older. In such case DB would contain HTML entities, not actual characters and those entities are converted to unicode characters.
* the script doesn't touch the second case, quite frequent, when Wikka 1.2 and older would be used with a small hack, when user would alter the template setting HTML charset to UTF-8. In this case database contains text which is double-unicoded: a character is encoded into UTF-8, hence it takes two bytes. Now those two bytes are treated as two separate characters and both are saved as utf-8, turning into four bytes. Now, I think we need a script that takes care of this issue, but it should be a separate script.
@DarTar: the thing you describe in comment:12 is caused by installer being unable to properly save *default* pages into database. I described this error in #1008 and provided a solution (a simple, trivial one). What the installer does, is takes code from .txt file, which is utf-8 and inserts it into database without first letting MySQL know that we're inserting utf-8/ Hence database treats each byte as a separate character, and each non-ASCII character ends up taking four bytes (it's double-unicoded). Fishy's script has nothing to do with this.
Another issue with installer — when you type wrong settings, the reloaded page (having errors marked in red) has both theme and default language reset to defaults. This can be easily fixed. In file setup/default.php replace:
<tr><td align="right" nowrap="nowrap">Theme:</td><td><?php SelectTheme(); ?></td></tr>
<tr><td align="right" nowrap="nowrap">Language pack:</td><td><?php Language_selectbox('en'); ?></td></tr>
with
<tr><td align="right" nowrap="nowrap">Theme:</td><td><?php SelectTheme($wakkaConfig["theme"]); ?></td></tr> <tr><td align="right" nowrap="nowrap">Language pack:</td><td><?php Language_selectbox($wakkaConfig["default_lang"]); ?></td></tr>
Is there any reason not to make this change?
Replying to KrzysztofTrybowski:
What you see is raw UTF-8. This is what you see if you have utf8 in bytes but interpret them as let's say iso-8859-1 or ascii. Can you direct me to this “fishy's Unicode conversion script” you mention? Can I test it?
I've uploaded the latest conversion script to the 1.3 branch. It's part of revision [1715], so if you do an svn update you will get the script.
There is one issue in the installer, while it installs fresh Wikka: default pages (that should be encoded in utf-8) are loaded as ascii. Therefore for example after installing a fresh Polish Wikka, all Polish characters are messed up.
The fix is to modify update_default_page function:
$body = implode('', file($txt_filepath));
mysql_query('update '.$config['table_prefix'].'pages set latest = "N" where tag = \''.$tag.'\'', $dblink);
changes to:
$body = implode('', file($txt_filepath));
mysql_query("SET NAMES 'utf8'", $dblink);
mysql_query('update '.$config['table_prefix'].'pages set latest = "N" where tag = \''.$tag.'\'', $dblink);
Also there's another problem related to upgrades, but it involves upgrading from a non-standard wikka installation (not sure if we want to bother). Anyway —
it was a common technique until version 1.3 to alter Wikka's template, so that it defined html charset as utf-8. Now if we upgrade to Wikka 1.3, the non-ASCII characters display as “raw unicode”, meaning that for example: instead of ó you see Ă³ instead of ś you see Å›. This is a result of inserting into DB a unicode text, without declaring SET NAMES 'utf8' first. In this case I believe that non-ASCII characters are “double-unicoded” in the DB.
As of [1725] Polish UI translation is updated. Default pages are still pending. Anyway before default pages can be used, the error in installer has to be fixed (it treats text files containing default pages not as utf-8 but as a single-byte charset).
What you see is raw UTF-8. This is what you see if you have utf8 in bytes but interpret them as let's say iso-8859-1 or ascii.
Can you direct me to this “fishy's Unicode conversion script” you mention? Can I test it?