[ Avaa Bypassed ]




Upload:

Command:

hmhc3928@18.189.192.107: ~ $
=head1 NAME

perlunitut - Perl Unicode Tutorial

=head1 DESCRIPTION

The days of just flinging strings around are over. It's well established that
modern programs need to be capable of communicating funny accented letters, and
things like euro symbols. This means that programmers need new habits. It's
easy to program Unicode capable software, but it does require discipline to do
it right.

There's a lot to know about character sets, and text encodings. It's probably
best to spend a full day learning all this, but the basics can be learned in
minutes. 

These are not the very basics, though. It is assumed that you already
know the difference between bytes and characters, and realise (and accept!)
that there are many different character sets and encodings, and that your
program has to be explicit about them. Recommended reading is "The Absolute
Minimum Every Software Developer Absolutely, Positively Must Know About Unicode
and Character Sets (No Excuses!)" by Joel Spolsky, at
L<http://joelonsoftware.com/articles/Unicode.html>.

This tutorial speaks in rather absolute terms, and provides only a limited view
of the wealth of character string related features that Perl has to offer. For
most projects, this information will probably suffice.

=head2 Definitions

It's important to set a few things straight first. This is the most important
part of this tutorial. This view may conflict with other information that you
may have found on the web, but that's mostly because many sources are wrong.

You may have to re-read this entire section a few times...

=head3 Unicode

B<Unicode> is a character set with room for lots of characters. The ordinal
value of a character is called a B<code point>.   (But in practice, the
distinction between code point and character is blurred, so the terms often
are used interchangeably.)

There are many, many code points, but computers work with bytes, and a byte has
room for only 256 values.  Unicode has many more characters than that,
so you need a method to make these accessible.

Unicode is encoded using several competing encodings, of which UTF-8 is the
most used. In a Unicode encoding, multiple subsequent bytes can be used to
store a single code point, or simply: character.

=head3 UTF-8

B<UTF-8> is a Unicode encoding. Many people think that Unicode and UTF-8 are
the same thing, but they're not. There are more Unicode encodings, but much of
the world has standardized on UTF-8. 

UTF-8 treats the first 128 codepoints, 0..127, the same as ASCII. They take
only one byte per character. All other characters are encoded as two or more
(up to six) bytes using a complex scheme. Fortunately, Perl handles this for
us, so we don't have to worry about this.

=head3 Text strings (character strings)

B<Text strings>, or B<character strings> are made of characters. Bytes are
irrelevant here, and so are encodings. Each character is just that: the
character.

On a text string, you would do things like:

    $text =~ s/foo/bar/;
    if ($string =~ /^\d+$/) { ... }
    $text = ucfirst $text;
    my $character_count = length $text;

The value of a character (C<ord>, C<chr>) is the corresponding Unicode code
point.

=head3 Binary strings (byte strings)

B<Binary strings>, or B<byte strings> are made of bytes. Here, you don't have
characters, just bytes. All communication with the outside world (anything
outside of your current Perl process) is done in binary.

On a binary string, you would do things like:

    my (@length_content) = unpack "(V/a)*", $binary;
    $binary =~ s/\x00\x0F/\xFF\xF0/;  # for the brave :)
    print {$fh} $binary;
    my $byte_count = length $binary;

=head3 Encoding

B<Encoding> (as a verb) is the conversion from I<text> to I<binary>. To encode,
you have to supply the target encoding, for example C<iso-8859-1> or C<UTF-8>.
Some encodings, like the C<iso-8859> ("latin") range, do not support the full
Unicode standard; characters that can't be represented are lost in the
conversion.

=head3 Decoding

B<Decoding> is the conversion from I<binary> to I<text>. To decode, you have to
know what encoding was used during the encoding phase. And most of all, it must
be something decodable. It doesn't make much sense to decode a PNG image into a
text string.

=head3 Internal format

Perl has an B<internal format>, an encoding that it uses to encode text strings
so it can store them in memory. All text strings are in this internal format.
In fact, text strings are never in any other format!

You shouldn't worry about what this format is, because conversion is
automatically done when you decode or encode.

=head2 Your new toolkit

Add to your standard heading the following line:

    use Encode qw(encode decode);

Or, if you're lazy, just:

    use Encode;

=head2 I/O flow (the actual 5 minute tutorial)

The typical input/output flow of a program is:

    1. Receive and decode
    2. Process
    3. Encode and output

If your input is binary, and is supposed to remain binary, you shouldn't decode
it to a text string, of course. But in all other cases, you should decode it.

Decoding can't happen reliably if you don't know how the data was encoded. If
you get to choose, it's a good idea to standardize on UTF-8.

    my $foo   = decode('UTF-8', get 'http://example.com/');
    my $bar   = decode('ISO-8859-1', readline STDIN);
    my $xyzzy = decode('Windows-1251', $cgi->param('foo'));

Processing happens as you knew before. The only difference is that you're now
using characters instead of bytes. That's very useful if you use things like
C<substr>, or C<length>.

It's important to realize that there are no bytes in a text string. Of course,
Perl has its internal encoding to store the string in memory, but ignore that.
If you have to do anything with the number of bytes, it's probably best to move
that part to step 3, just after you've encoded the string. Then you know
exactly how many bytes it will be in the destination string.

The syntax for encoding text strings to binary strings is as simple as decoding:

    $body = encode('UTF-8', $body);

If you needed to know the length of the string in bytes, now's the perfect time
for that. Because C<$body> is now a byte string, C<length> will report the
number of bytes, instead of the number of characters. The number of
characters is no longer known, because characters only exist in text strings.

    my $byte_count = length $body;

And if the protocol you're using supports a way of letting the recipient know
which character encoding you used, please help the receiving end by using that
feature! For example, E-mail and HTTP support MIME headers, so you can use the
C<Content-Type> header. They can also have C<Content-Length> to indicate the
number of I<bytes>, which is always a good idea to supply if the number is
known.

    "Content-Type: text/plain; charset=UTF-8",
    "Content-Length: $byte_count"

=head1 SUMMARY

Decode everything you receive, encode everything you send out. (If it's text
data.)

=head1 Q and A (or FAQ)

After reading this document, you ought to read L<perlunifaq> too. 

=head1 ACKNOWLEDGEMENTS

Thanks to Johan Vromans from Squirrel Consultancy. His UTF-8 rants during the
Amsterdam Perl Mongers meetings got me interested and determined to find out
how to use character encodings in Perl in ways that don't break easily.

Thanks to Gerard Goossen from TTY. His presentation "UTF-8 in the wild" (Dutch
Perl Workshop 2006) inspired me to publish my thoughts and write this tutorial.

Thanks to the people who asked about this kind of stuff in several Perl IRC
channels, and have constantly reminded me that a simpler explanation was
needed.

Thanks to the people who reviewed this document for me, before it went public.
They are: Benjamin Smith, Jan-Pieter Cornet, Johan Vromans, Lukas Mai, Nathan
Gray.

=head1 AUTHOR

Juerd Waalboer <#####@juerd.nl>

=head1 SEE ALSO

L<perlunifaq>, L<perlunicode>, L<perluniintro>, L<Encode>


Filemanager

Name Type Size Permission Actions
a2p.pod File 5.96 KB 0644
perl.pod File 15.43 KB 0644
perl5004delta.pod File 54.92 KB 0644
perl5005delta.pod File 33.48 KB 0644
perl5100delta.pod File 53.41 KB 0644
perl5101delta.pod File 42.86 KB 0644
perl5120delta.pod File 87.19 KB 0644
perl5121delta.pod File 9.91 KB 0644
perl5122delta.pod File 9.38 KB 0644
perl5123delta.pod File 4 KB 0644
perl5124delta.pod File 3.59 KB 0644
perl5140delta.pod File 140.94 KB 0644
perl5141delta.pod File 7.78 KB 0644
perl5142delta.pod File 6.73 KB 0644
perl5143delta.pod File 7.58 KB 0644
perl5160delta.pod File 130.52 KB 0644
perl5161delta.pod File 6 KB 0644
perl5162delta.pod File 3.51 KB 0644
perl5163delta.pod File 3.99 KB 0644
perl561delta.pod File 121.79 KB 0644
perl56delta.pod File 104.68 KB 0644
perl581delta.pod File 37.17 KB 0644
perl582delta.pod File 4.37 KB 0644
perl583delta.pod File 6.19 KB 0644
perl584delta.pod File 7.19 KB 0644
perl585delta.pod File 5.75 KB 0644
perl586delta.pod File 4.54 KB 0644
perl587delta.pod File 8.16 KB 0644
perl588delta.pod File 24.68 KB 0644
perl589delta.pod File 52.64 KB 0644
perl58delta.pod File 112.21 KB 0644
perlaix.pod File 18.77 KB 0644
perlamiga.pod File 6.87 KB 0644
perlapi.pod File 315.46 KB 0644
perlapio.pod File 18.88 KB 0644
perlartistic.pod File 6.85 KB 0644
perlbeos.pod File 2.87 KB 0644
perlbook.pod File 7.19 KB 0644
perlboot.pod File 183 B 0644
perlbot.pod File 182 B 0644
perlbs2000.pod File 7.73 KB 0644
perlcall.pod File 54.03 KB 0644
perlce.pod File 8.72 KB 0644
perlcheat.pod File 4.39 KB 0644
perlclib.pod File 7.5 KB 0644
perlcn.pod File 4.82 KB 0644
perlcommunity.pod File 6.29 KB 0644
perlcygwin.pod File 27.17 KB 0644
perldata.pod File 36.33 KB 0644
perldbmfilter.pod File 4.86 KB 0644
perldebguts.pod File 36.79 KB 0644
perldebtut.pod File 20.79 KB 0644
perldebug.pod File 38.15 KB 0644
perldelta.pod File 3.99 KB 0644
perldgux.pod File 2.75 KB 0644
perldiag.pod File 207.82 KB 0644
perldos.pod File 10.28 KB 0644
perldsc.pod File 24.84 KB 0644
perldtrace.pod File 6.21 KB 0644
perlebcdic.pod File 67.61 KB 0644
perlembed.pod File 35.21 KB 0644
perlepoc.pod File 3.57 KB 0644
perlexperiment.pod File 4.84 KB 0644
perlfaq.pod File 22.01 KB 0644
perlfaq1.pod File 13.52 KB 0644
perlfaq2.pod File 9.28 KB 0644
perlfaq3.pod File 37.46 KB 0644
perlfaq4.pod File 87.39 KB 0644
perlfaq5.pod File 54.11 KB 0644
perlfaq6.pod File 38.66 KB 0644
perlfaq7.pod File 36.35 KB 0644
perlfaq8.pod File 48.28 KB 0644
perlfaq9.pod File 14.71 KB 0644
perlfork.pod File 12.78 KB 0644
perlform.pod File 16.29 KB 0644
perlfreebsd.pod File 1.55 KB 0644
perlfunc.pod File 338.43 KB 0644
perlgit.pod File 29.75 KB 0644
perlglossary.pod File 110.66 KB 0644
perlgpl.pod File 13.54 KB 0644
perlguts.pod File 111.66 KB 0644
perlhack.pod File 35.03 KB 0644
perlhacktips.pod File 45.5 KB 0644
perlhacktut.pod File 6.07 KB 0644
perlhaiku.pod File 1.47 KB 0644
perlhist.pod File 43.32 KB 0644
perlhpux.pod File 28.07 KB 0644
perlhurd.pod File 1.94 KB 0644
perlintern.pod File 42.53 KB 0644
perlinterp.pod File 30 KB 0644
perlintro.pod File 22.08 KB 0644
perliol.pod File 33.03 KB 0644
perlipc.pod File 70.17 KB 0644
perlirix.pod File 4.29 KB 0644
perljp.pod File 7.57 KB 0644
perlko.pod File 7.52 KB 0644
perllexwarn.pod File 14.61 KB 0644
perllinux.pod File 1.45 KB 0644
perllocale.pod File 51.43 KB 0644
perllol.pod File 10.93 KB 0644
perlmacos.pod File 1001 B 0644
perlmacosx.pod File 10.4 KB 0644
perlmod.pod File 24.04 KB 0644
perlmodinstall.pod File 12.41 KB 0644
perlmodlib.pod File 78.49 KB 0644
perlmodstyle.pod File 20.76 KB 0644
perlmpeix.pod File 14.24 KB 0644
perlmroapi.pod File 3.13 KB 0644
perlnetware.pod File 6.35 KB 0644
perlnewmod.pod File 10.95 KB 0644
perlnumber.pod File 8.16 KB 0644
perlobj.pod File 33.65 KB 0644
perlootut.pod File 25.6 KB 0644
perlop.pod File 121.73 KB 0644
perlopenbsd.pod File 1.18 KB 0644
perlopentut.pod File 37.53 KB 0644
perlos2.pod File 90.53 KB 0644
perlos390.pod File 15.2 KB 0644
perlos400.pod File 4.51 KB 0644
perlpacktut.pod File 49.83 KB 0644
perlperf.pod File 50.05 KB 0644
perlplan9.pod File 5 KB 0644
perlpod.pod File 21.27 KB 0644
perlpodspec.pod File 66.2 KB 0644
perlpolicy.pod File 19.73 KB 0644
perlport.pod File 82.63 KB 0644
perlpragma.pod File 5.11 KB 0644
perlqnx.pod File 4.14 KB 0644
perlre.pod File 100.76 KB 0644
perlreapi.pod File 25.17 KB 0644
perlrebackslash.pod File 25.64 KB 0644
perlrecharclass.pod File 34.19 KB 0644
perlref.pod File 28.32 KB 0644
perlreftut.pod File 18.23 KB 0644
perlreguts.pod File 36 KB 0644
perlrequick.pod File 17.5 KB 0644
perlreref.pod File 14.19 KB 0644
perlretut.pod File 115.13 KB 0644
perlriscos.pod File 1.49 KB 0644
perlrun.pod File 49.58 KB 0644
perlsec.pod File 22.77 KB 0644
perlsolaris.pod File 28.63 KB 0644
perlsource.pod File 6.19 KB 0644
perlstyle.pod File 8.42 KB 0644
perlsub.pod File 55.15 KB 0644
perlsymbian.pod File 15.44 KB 0644
perlsyn.pod File 41.04 KB 0644
perlthrtut.pod File 45.41 KB 0644
perltie.pod File 37.02 KB 0644
perltoc.pod File 639 KB 0644
perltodo.pod File 362 B 0644
perltooc.pod File 183 B 0644
perltoot.pod File 183 B 0644
perltrap.pod File 40.28 KB 0644
perltru64.pod File 7.55 KB 0644
perltw.pod File 5.15 KB 0644
perlunicode.pod File 70.89 KB 0644
perlunifaq.pod File 13.31 KB 0644
perluniintro.pod File 35.44 KB 0644
perluniprops.pod File 229.74 KB 0644
perlunitut.pod File 7.76 KB 0644
perlutil.pod File 9.68 KB 0644
perluts.pod File 3.11 KB 0644
perlvar.pod File 69.19 KB 0644
perlvmesa.pod File 3.88 KB 0644
perlvms.pod File 51.33 KB 0644
perlvos.pod File 5.82 KB 0644
perlwin32.pod File 34.58 KB 0644
perlxs.pod File 71.66 KB 0644
perlxstut.pod File 48.52 KB 0644
perlxstypemap.pod File 22.97 KB 0644