How to Determine and Change File Character Encoding of Text Files in Linux Systems

Posted by: Mohammed Semari | Published: January 17, 2017| Updated: February 26, 2017

How many times did you want to find and detect the encoding of a text files in Linux systems? or How many times did you try to watch a movie and it’s subtitles .srt showed in unreadable shapes “characters” ?

Sure many times you tried/needed to know/change the encoding of text files in Linux systems.

All this because you are using a wrong encoding format for your text files. The solution for this is very simple Just knowing the text files encoding will end your problems. You can either “for example” set your media player to use the correct encoding for your subtitles OR YOU CAN CHANGE THE ENCODING OF YOUR TEXT FILES TO A GLOBAL ACCEPTED ENCODING “UTF-8 FOR EXAMPLE”

This post is divided into two parts. In part 1: I’ll show you how to find and detect the text files encoding in Linux systems using Linux file command available by default in all Linux distributions. In part 2: I’ll show you how to change the encoding of the text files using iconv Linux command between CP1256 (Windows-1256, Cyrillic), UTF-8, ISO-8859-1 and ASCII character sets.

Part 1: Detect a File’s Encoding using `file` Linux command

The file command makes “best-guesses” about the encoding. Use the following command to determine what character encoding is used by a file :

$ file -bi [filename]

Option	Description
-b, `--brief`	Don’t print filename (brief mode)
-i, `--mime`	Print filetype and encoding

Example 1 : Detect the encoding of the file “storks.srt”

$ file -ib storks.srt
text/plain; charset=iso-8859-1

As you see, “storks.srt” file is encoded with iso-8859-1

Example 2 : Detect the encoding of the file “The.Girl.on.the.Train.2016.1080p.WEB-DL.DD5.1.H264-FGT.srt”

$ file -ib The.Girl.on.the.Train.2016.1080p.WEB-DL.DD5.1.H264-FGT.srt 
text/plain; charset=utf-8

Here’s the “The.Girl.on.the.Train.2016.1080p.WEB-DL.DD5.1.H264-FGT.srt” file is utf-8 encoded.

Finally, file command is perfect for telling you what exactly the encoding of a text file. You can use it to detect if your text file is encoded with UTF-8, WINDOWS-1256, ISO-8859-6, GEORGIAN-ACADEMY, etc…

Part 2: Change a File’s Encoding using `iconv` Linux command

To use iconv Linux command you need to know the encoding of the text file you need to change it. Use the following syntax to convert the encoding of a file :

$ iconv -f [encoding] -t [encoding]  [filename] > [output_filename]

Option	Description
-f, `--from-code`	Convert characters from encoding
-t, `--to-code`	Convert characters to encoding

Example 1: Convert a file’s encoding from iso-8859-1 to UTF-8 and save it to New_storks.srt

$ iconv -f iso-8859-1 -t utf-8 storks.srt > New_storks.srt

Here’s the New_storks.srt is UTF-8 encoded.

Example 2: Convert a file’s encoding from cp1256 to UTF-8 and save it to output.txt

$ iconv -f cp1256 -t utf-8  input.txt > output.txt

Here’s the output.txt is UTF-8 encoded.

Example 3: Convert a file’s encoding from ASCII to UTF-8 and save it to output.txt

$ iconv -f ascii -t utf-8 input.txt > output.txt

Here’s the output.txt is UTF-8 encoded.

Example 4: Convert a file’s encoding from UTF-8 to ASCII

Hints:

1. UTF-8 can contain characters that can't be encoded with ASCII, the iconv will generate the error message "illegal input sequence at position X" unless you tell it to strip all non-ASCII characters using the -c option.
2. With using iconv with the -c option, you could loose some characters from your text file.

$ iconv -c -f utf-8 -t ascii  input.txt > output.txt

Option	Description
-c	Omit invalid characters from output

Finally, to list all the coded character sets known run -l option with iconv as follow:

$ iconv -l

Option	Description
-l, `--list`	List known coded character sets

Here’s the output of the above command:

The following list contain all the coded character sets known. This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters. One coded character set can be
listed with several different names (aliases).

437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
 866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
 8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
 ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
 ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5,
 BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5,
 CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278,
 CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424,
 CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813,
 CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863,
 CP864, CP865, CP866, CP866NAV, CP868, CP869, CP870, CP871, CP874, CP875,
 CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918,
 CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949,
 CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079,
........

Finally, I hope this article is useful for you.

If You Appreciate What We Do Here On Mimastech, You Should Consider:

Stay Connected to: Facebook | Twitter | Google+
Support us via PayPal Donation
Subscribe to our email newsletters.
Tell other sysadmins / friends about Us - Share and Like our posts and services