Broken handling of unicode strings [rt.cpan.org #54369]

Migrated from [rt.cpan.org#54369](https://rt.cpan.org/Ticket/Display.html?id=54369) (status was 'new')

Requestors:
* cr2005@u-club.de

Attachments:
* [bugreport.digest](https://rt.cpan.org/Ticket/Attachment/730014/377101/bugreport.digest)
* [test-digest-sha1-unicode.pl](https://rt.cpan.org/Ticket/Attachment/731863/378088/test-digest-sha1-unicode.pl)


From cr2005@u-club.de on 2010-02-06 18:44:15
:
```
This is a follow up of
"clears utf-8 flag"
https://rt.cpan.org/Public/Bug/Display.html?id=17919

* I do believe the bug is in 'Digest', all algorithm are bugged.
* Digest->add() handles utf8-strings incorrect (clears utf8-flag)
* the calculated digest of an utf-8 string is false!

To understand the problem, I have to start with an example why and 
when to 'use utf8'. 
I presume you have an utf8 enabled box. I've got here debian GNU/Linux 
5.0 :

echo $LANG
de_DE.UTF-8

echo -n blÃ¶dsinn | hexdump -C
00000000  62 6c c3 b6 64 73 69 6e  6e                       |
bl..dsinn|

I know, this pain in the ase.

Since this web page does not define a charset, it might mess up my 
post.
I will attach this report as an utf8 encoded textfile.

#!/bin/perl -w

$a = 'blÃ¶dsinn';

printf "len of '$a' is %s\n", length $a;
__END__

save this perl script as utf8, e.g. with vi

:set fileencoding utf-8
:w test.pl
:q

$ perl test.pl
output:

len of 'blÃ¶dsinn' is 9
(wrong.)

There are two reasons to use utf8:
- assign utf8 strings within a perl script
- import the is_utf8() function

So let's fix the Example:

#!/bin/perl -w

use utf8;

$a = 'blÃ¶dsinn';
# utf8::is_utf8($a) will now return 1

printf "len of '$a' is %s\n", length $a;
__END__

$ perl test.pl
output:

len of 'bl?dsinn' is 8

So, length (of characters) is now correct but how about the garbled 
output? By default perl does not set std/in/out/err to utf8. Read 
about the -C flag in man perlrun:

$ perl -CSDL test.pl

output:

len of 'blÃ¶dsinn' is 8

Terrific. Now we got an working setup to look into Digest's problem.
Let's trash it:

#!/bin/perl -w

use utf8;
use Digest;
# the next line contains a random asia character:
$s = 'greetz from asia æ¥­ !';

printf "string is '$s' len is: %s flag: %s\n", length($s), 
utf8::is_utf8($s);

# pick any digest u like
$dgst = Digest->new('SHA-1');
$dgst->add($s); # 'this is line 12'

printf "Digest of '$s' is %s\n", $dgst->hexdigest;

__END__

Carefully copy'n'paste the fancy double byte character. (I dunno what 
it means ;)

$ perl -CSDL test.pl
string is 'greetz from asia æ¥­ !' len is: 20 flag: 1
Wide character in subroutine entry at test.pl line 12.

Bang! BÃ¤mm. Epic fail.

We can't cheat around this problem:

#!/bin/perl -w

use utf8;
use Digest;

$s = "blÃ¶dsinn";

printf "string is '$s' len is: %s flag: %s\n", length($s), 
utf8::is_utf8($s);

# pick any digest u like
$dgst = Digest->new('SHA-1');
$dgst -> add($s);

printf "Digest is %s\n", $dgst->hexdigest;
printf "string is '$s' len is: %s flag: %s\n", length($s), 
utf8::is_utf8($s);
__END__

$ perl -CSDL test.pl

string is 'blÃ¶dsinn' len is: 8 flag: 1
Digest is c77a16d028753a1ae761ad8eb33f5bc307364a24
string is 'blÃ¶dsinn' len is: 8 flag:

echo -n blÃ¶dsinn | openssl dgst -sha1
bd0f217087566043ca73d9e9ce81f7c9a4311872

Well. Here everything went wrong. If the utf8 flag is set on a string:

   * looks like Digest tries to convert it to latin1 which fails with 
characters not defined in latin1:
     The above digest of 'c77a16d028753a1ae761ad8eb33f5bc307364a24' is 
correct for a latin1 string of 'blÃ¶dsinn'

   * Digest resets the is_utf8 flag, but the scalar still contains the 
utf8 encoded string,
     so later my application runs into big trouble

   * => Digest calculates "wrong" or refuses to work at all.


So there is an unnecessary character conversation happening under the 
hood of Digest.




```

From cr2005@u-club.de on 2010-02-09 16:21:08
:
```
Am Sa 06. Feb 2010, 13:44:15, chr schrieb:
> 
> So there is an unnecessary character conversation happening under 
> the hood of Digest.

Well, I looked into MD5.xs and SHA1.xs

I'm wondering about the SvPVbyte gotcha. I'm using perl v5.10.0 and 
using SvPV fixes the issue for me:

#undef SvPVbyte
#define SvPVbyte SvPV

Now ->add() doesn't reset the utf8 flag, it eats multi byte chars and 
hashes are correct.


```

From cr2005@u-club.de on 2010-02-10 16:39:37
:
```
$ perl test-digest-sha1-unicode.pl
1..12
ok 1 - 'nonsense' is utf8
ok 2 - length 8
ok 3 - hash cb1dc474e185777dad218b7d60f2781723d8190b
not ok 4 - 'nonsense' is still utf8
#   Failed test ''nonsense' is still utf8'
#   at test-digest-sha1-unicode.pl line 26.
ok 5 - 'blÃ¶dsinn' is utf8
ok 6 - length 8
not ok 7 - hash bd0f217087566043ca73d9e9ce81f7c9a4311872
#   Failed test 'hash bd0f217087566043ca73d9e9ce81f7c9a4311872'
#   at test-digest-sha1-unicode.pl line 25.
#          got: 'c77a16d028753a1ae761ad8eb33f5bc307364a24'
#     expected: 'bd0f217087566043ca73d9e9ce81f7c9a4311872'
not ok 8 - 'blÃ¶dsinn' is still utf8
#   Failed test ''blÃ¶dsinn' is still utf8'
#   at test-digest-sha1-unicode.pl line 26.
ok 9 - 'å»¢è©±' is utf8
ok 10 - length 2
Wide character in subroutine entry at test-digest-sha1-unicode.pl line 
24, <DATA> line 3.
# Looks like you planned 12 tests but ran 10.
# Looks like you failed 3 tests of 10 run.
# Looks like your test exited with 255 just after 10.


patched (using SvPV):

LD_PRELOAD=~/.cpan/build/Digest-SHA1-2.12-amnPuM/blib/arch/auto/Digest/SHA1/SHA1.so 
perl test-digest-sha1-unicode.pl
1..12
ok 1 - 'nonsense' is utf8
ok 2 - length 8
ok 3 - hash cb1dc474e185777dad218b7d60f2781723d8190b
ok 4 - 'nonsense' is still utf8
ok 5 - 'blÃ¶dsinn' is utf8
ok 6 - length 8
ok 7 - hash bd0f217087566043ca73d9e9ce81f7c9a4311872
ok 8 - 'blÃ¶dsinn' is still utf8
ok 9 - 'å»¢è©±' is utf8
ok 10 - length 2
ok 11 - hash aabc1a331f97ba4b4157abca134812992d22dccf
ok 12 - 'å»¢è©±' is still utf8


```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken handling of unicode strings [rt.cpan.org #54369] #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Broken handling of unicode strings [rt.cpan.org #54369] #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions