iconv and charset conversion troubles

This is a discussion on iconv and charset conversion troubles within the Discussion of php and server side scripting (alt.php) forum.

iconv and charset conversion troubles

Postby Christoph Burschka on Mon Dec 01, 2008 4:29 pm

Hi,

I am trying to aggregate content on a website into a database, and
getting severe encoding troubles with the mdash character (—
U+2014) as well as a bullet point (•, U+2022) and probably other
special characters too.

The remote website declares its charset as ISO-8859-1, and when viewing
it as such in the browser, I can see the — and • characters
just fine. When looking at aggregated content (HTTP via fsock_open) on
my own website, which declares UTF-8, of course the characters do not
display correctly.

Leaving aside the database for later, first I wanted to convert the
string such that it would display properly on my UTF-8 website. I
assumed this would be done with

$data = iconv('ISO-8859-1', 'UTF-8', $data);

However, the converted content will not display properly either, so it's
clear I need some more advice.

To avoid ambiguity or encoding troubles, I am showing all the characters
in base 64 encoding.

The character that the remote website sends is "lw==" in base 64.

Converted with the above iconv() command, it becomes "wpc=".

When I copy-paste the rendered character into a PHP script and encode
that, it becomes "4oCU". Not sure which encoding that is.

How should I approach this problem? Thanks

--
Christoph Burschka
Christoph Burschka
 
Posts: 199
Joined: Wed Nov 15, 2006 2:19 pm

Re: iconv and charset conversion troubles

Postby =?ISO-8859-1?Q?=22=C1lvaro_G=2E_Vicario=22?= on Tue Dec 02, 2008 1:35 am

Christoph Burschka escribió:
> I am trying to aggregate content on a website into a database, and
> getting severe encoding troubles with the mdash character (—
> U+2014) as well as a bullet point (•, U+2022) and probably other
> special characters too.

According to http://www.decodeunicode.org/ the codes are right.

> The remote website declares its charset as ISO-8859-1, and when viewing
> it as such in the browser, I can see the — and • characters
> just fine. When looking at aggregated content (HTTP via fsock_open) on
> my own website, which declares UTF-8, of course the characters do not
> display correctly.

I understand you have the raw chars, not the HTML entities, correct?

> Leaving aside the database for later, first I wanted to convert the
> string such that it would display properly on my UTF-8 website. I
> assumed this would be done with
>
> $data = iconv('ISO-8859-1', 'UTF-8', $data);

If the input is ISO-8859-1 you can also use utf8_encode().

> However, the converted content will not display properly either, so it's
> clear I need some more advice.
>
> To avoid ambiguity or encoding troubles, I am showing all the characters
> in base 64 encoding.
>
> The character that the remote website sends is "lw==" in base 64.

We'll check against http://en.wikipedia.org/wiki/Iso-8859-1

print_r( unpack('C*', base64_decode('lw==')) );

Array
(
[1] => 151
)

Well, this is not a dash; it's an EPA char (End of Guarded Area). Your
input is *not* ISO-8859-1. It might be cp1252:

http://en.wikipedia.org/wiki/Windows-1252





--
-- http://alvaro.es/ - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://bits.demogracia.com/
-- Mi web de humor al baño María: http://www.demogracia.com/
--
=?ISO-8859-1?Q?=22=C1lvaro_G=2E_Vicario=22?=
 
Posts: 213
Joined: Thu Aug 09, 2007 1:32 pm

Re: iconv and charset conversion troubles

Postby Christoph Burschka on Tue Dec 02, 2008 3:42 am

Thanks very much for your help!

On 02.12.2008 09:35, Álvaro G. Vicario wrote:
> Christoph Burschka escribió:
>
> According to http://www.decodeunicode.org/ the codes are right.
>
> I understand you have the raw chars, not the HTML entities, correct?

Yeah, the page is sending raw characters, but if I pasted them here
they'd be converted automatically by my email client, so that wouldn't help.

> If the input is ISO-8859-1 you can also use utf8_encode().
>
> We'll check against http://en.wikipedia.org/wiki/Iso-8859-1
>
> print_r( unpack('C*', base64_decode('lw==')) );
>
> Array
> (
> [1] => 151
> )
>
> Well, this is not a dash; it's an EPA char (End of Guarded Area). Your
> input is *not* ISO-8859-1. It might be cp1252:
>
> http://en.wikipedia.org/wiki/Windows-1252
>

That's interesting, thanks. I do know that the data is user-submitted,
and that the content management software might not ensure the content
matches the declared ISO-8859-1 charset. When looking at it in Firefox
3.1, choosing the ISO-8859-1 charset will *work* while UTF-8 *doesn't*,
but I don't know how Firefox handles quirky encoding.

However, I got this lw== using preg_match, which may not be binary-safe.
I've used mb_strpos and mb_substr this time, and also got some of the
surrounding characters to be sure all of the multi-byte string is there.

The result of this, in base 64, is IC8+wpdB, which is

Array
(
[1] => 32
[2] => 47
[3] => 62
[4] => 194
[5] => 151
[6] => 65
)


Now the first part is " />", part of a linebreak. The last bit is "A".
There are two bytes in between, 194 and 151, which by elimination must
be the encoded character.

So now I have to find out how these two bytes are encoded.

--
Arancaytar
Christoph Burschka
 
Posts: 199
Joined: Wed Nov 15, 2006 2:19 pm

Re: iconv and charset conversion troubles

Postby Christoph Burschka on Tue Dec 02, 2008 6:03 am

Never mind, I accidentally used the version that had passed through iconv().

The actual raw string is IC8+l0E=, or

Array
(
[1] => 32
[2] => 47
[3] => 62
[4] => 151
[5] => 65
)

Which doesn't have the 194 character. So there is indeed only a
single-byte character, namely 151. Which does match the Windows encoding
you mentioned, so it seems that the remote site sends the wrong charset.

--
Christoph Burschka
Christoph Burschka
 
Posts: 199
Joined: Wed Nov 15, 2006 2:19 pm

Re: iconv and charset conversion troubles

Postby Christoph Burschka on Tue Dec 02, 2008 7:07 am

Thanks very much for your help!

On 02.12.2008 09:35, Álvaro G. Vicario wrote:

> Christoph Burschka escribió:
> According to http://www.decodeunicode.org/ the codes are right.
> I understand you have the raw chars, not the HTML entities, correct?

Yeah, the page is sending raw characters, but if I pasted them here
they'd be converted automatically by my email client, so that wouldn't help.

> If the input is ISO-8859-1 you can also use utf8_encode().
> We'll check against http://en.wikipedia.org/wiki/Iso-8859-1
> print_r( unpack('C*', base64_decode('lw==')) );
> Array
> (
> [1] => 151
> )
> Well, this is not a dash; it's an EPA char (End of Guarded Area). Your
> input is *not* ISO-8859-1. It might be cp1252:
> http://en.wikipedia.org/wiki/Windows-1252

That's interesting, thanks. I do know that the data is user-submitted,
and that the content management software might not ensure the content
matches the declared ISO-8859-1 charset. When looking at it in Firefox
3.1, choosing the ISO-8859-1 charset will *work* while UTF-8 *doesn't*,
but I don't know how Firefox handles quirky encoding.

However, I got this lw== using preg_match, which may not be binary-safe.
I've used mb_strpos and mb_substr this time, and also got some of the
surrounding characters to be sure all of the multi-byte string is there.

The result of this, in base 64, is IC8+l0E=, or

Array
(
[1] => 32
[2] => 47
[3] => 62
[4] => 151
[5] => 65
)


Now the first part is " />", part of a linebreak. The last bit is "A".
So there is indeed only a single-byte character, namely 151. Which does
match the Windows-1252 encoding you mentioned, so it seems that the
remote site is lying about the charset. I just need to pass Cp1252 to
iconv() instead of ISO-8859-1, and it should work.

Unless the site mixes different charsets, which I hope it does not.

--
Christoph Burschka
Christoph Burschka
 
Posts: 199
Joined: Wed Nov 15, 2006 2:19 pm


Return to Discussion of php and server side scripting (alt.php)

Who is online

Users browsing this forum: No registered users and 0 guests