Home
Search Perl pages
Subjects
By activity
Professions, Sciences, Humanities, Business, ...
User Interface
Text-based, GUI, Audio, Video, Keyboards, Mouse, Images,...
Text Strings
Conversions, tests, processing, manipulation,...
Math
Integer, Floating point, Matrix, Statistics, Boolean, ...
Processing
Algorithms, Memory, Process control, Debugging, ...
Stored Data
Data storage, Integrity, Encryption, Compression, ...
Communications
Networks, protocols, Interprocess, Remote, Client Server, ...
Hard World Timing, Calendar and Clock, Audio, Video, Printer, Controls...
File System
Management, Filtering, File & Directory access, Viewers, ...
|
|
|
This is hard, and there's no good way. Perl does not directly support wide
characters. It pretends that a byte and a character are synonymous. The
following set of approaches was offered by Jeffrey Friedl, whose article in
issue #5 of The Perl Journal talks about this very matter.
Let's suppose you have some weird Martian encoding where pairs of
ASCII uppercase letters encode single Martian letters (i.e. the two bytes
``CV'' make a single Martian letter, as do the two bytes
``SG'',
``VS'',
``XX'', etc.). Other bytes represent single characters, just like
ASCII.
So, the string of Martian
``I am
CVSGXX!'' uses 12 bytes to encode the nine characters
'I', ' ', 'a', 'm', ' ',
'CV',
'SG',
'XX', '!'.
Now, say you want to search for the single character /GX/ . Perl doesn't know about Martian, so it'll find the two bytes
``GX'' in the
``I am
CVSGXX!'' string, even though that character isn't there: it just looks like it is because
``SG'' is next to
``XX'', but there's no real
``GX''. This is a big problem.
Here are a few ways, all painful, to deal with it:
$martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
# are no longer adjacent.
print "found GX!\n" if $martian =~ /GX/;
Or like this:
@chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
# above is conceptually similar to: @chars = $text =~ m/(.)/g;
#
foreach $char (@chars) {
print "found GX!\n", last if $char eq 'GX';
}
Or like this:
while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
print "found GX!\n", last if $1 eq 'GX';
}
Or like this:
die "sorry, Perl doesn't (yet) have Martian support )-:\n";
In addition, a sample program which converts half-width to full-width katakana (in Shift-JIS or
EUC encoding) is available from
CPAN as
There are many double- (and multi-) byte encodings commonly used these
days. Some versions of these have 1-, 2-, 3-, and 4-byte characters, all
mixed.
Source: Perl FAQ: Regexps Copyright: Copyright (c) 1997 Tom Christiansen and Nathan Torkington. |
Next: Can I get a BNF/yacc/RE for the Perl language?
Previous: What's wrong with using grep or map in a void context?
(Corrections, notes, and links courtesy of RocketAware.com)
Up to: NUL terminated String Comparison and Search
Rapid-Links:
Search | About | Comments | Submit Path: RocketAware > Perl >
perlfaq6/How_can_I_match_strings_with_mul.htm
RocketAware.com is a service of Mib Software Copyright 2000, Forrest J. Cavalier III. All Rights Reserved. We welcome submissions and comments
|