Juan Valencia's Website

Detecting and changing the encoding of text files

When you receive and need to handle multiple text files that use characters that are not natural to the English language, you may run into the problem that is dealing with different character encodings. This is particularly noticeable in websites, where if the browser try to interpret the text file with an encoding that differs from the actual encoding that the file is using, we can see strange symbols where this characters were supposed to be, but it is not limited to websites, any program that is made to work with languages other than English may present a similar problem if it is not appropriately handled.

In the case of HTML archives, many people, and several programs by default, opt for changing this foreign characters with either HTML entities (e.g. á to place an á) or Iso Latin-1 code (e.g. á to place an á), but the truth is that nowadays every modern (and not so modern) browser can successfully handle encodings such as iso-8859-1 or utf-8, all that we have to do is choose an encoding and use that same encoding for all files to avoid conflicts, and specify to the browser that we are using that encoding. Personally I prefer to use utf-8 as I consider it a much more flexible and complete character set, and unless it is otherwise required I have standardized the use of utf-8 in all of my projects and in my systems in general.

To detect the encoding that is being used within a file, we can use the command "file". This command try to autodetect the encoding that a file is using. If no special characters are detected inside the text file, "file" will tell us that the encoding is us-ascii, and our editor can use whatever character encoding it is set to use by default. Of course, I set my editors to work with utf-8 by default.

file --mime-encoding file.txt

Once we have the encoding of the file, then we can transform it to a different character encoding if it's necessary, by using:

iconv --from-code=iso-8859-1 --to-code=utf-8 file.txt > file.txt.utf8
mv file.txt.utf8 file.txt

Changing the character encoding of multiple files

When we need to change the character encoding of one file, more often than not we have to change the character encoding of other files as well, to do this operation to several files at once we can use:

for old in *.txt;
iconv --from-code=iso-8859-1 --to-code=utf-8 $old > $old.utf8;

Once this is done, we can rename all the converted files to the name that they were generated from, in effect, replacing the original with the reencoded version:

for old in *.utf8;
cp $old `basename $old .utf8`;

basename give us the name of the file minus the ".utf8" part. If everything is ok, we can remove the temporal files that we created.

rm *.utf8