Detecting and changing the encoding of text files.

Tuesday, August 24, 2010 | En Español

When you receive and need to handle multiple text files that use characters that are not natural to the English language, you may run into the problem that is dealing with different character encodings. This is particularly noticeable in websites, where if the browser try to interpret the text file with an encoding that differs from the actual encoding that the file is using, we can see strange symbols where this characters were supposed to show, but it is not limited to websites, any program that is made to work with languages other than English may present a similar problem if it is not appropriately handled.

In the case of HTML archives, many people, and several programs by default, opt for change this foreign characters with either HTML entities (e.g. á to place an á) or Iso Latin-1 code (e.g. á to place an á), but the truth is that nowadays every modern (and not so modern) browser can successfully handle encodings such as iso-8859-1 or utf-8, all that we have to do is choose an encoding and use that same encoding for all files to avoid conflicts, and specify to the browser that we are using that encoding. Personally I prefer to use utf-8 as I consider it a much more flexible and complete character set, and unless it is otherwise required I have standardized the use of utf-8 in all my projects and in my systems in general.

To detect the encoding that is being used within a file, we can use the command "file". This command try to autodetect the encoding that a file is using. If no special characters are detected inside the text file, "file" will tell us that the encoding is us-ascii, and our editor can use whatever character encoding it is set to use by default. Of course, I set my editors to work with utf-8 by default.

file --mime-encoding file.txt

Once we have the encoding of the file, then we can transform it to a different character encoding if it's necessary, by using:

iconv --from-code=iso-8859-1 --to-code=utf-8 file.txt > file.txt.utf8
mv file.txt.utf8 file.txt

Changing the character encoding of multiple files

When we need to change the character encoding of one file, more often than not we have to change the character encoding of other files as well, to do this operation to several files at once we can use:

for old in *.txt;
do
iconv --from-code=iso-8859-1 --to-code=utf-8 $old > $old.utf8;
done

Once this is done, we can rename all the converted files to the name that they were generated from, in effect, replacing the original with the reencoded version:

for old in *.utf8;
do
cp $old `basename $old .utf8`;
done

basename give us the name of the file minus the ".utf8" part. If everything is ok, we can remove the temporal files that we created.

rm *.utf8

Categories: Commands, FOSS, Linux