How to Determine and Change File Character Encoding of Text Files in Linux Systems
How many times did you want to find and detect the encoding of a text files in Linux systems? or How many times did you try to watch a movie and it’s subtitles .srt showed in unreadable shapes “characters” ?
Sure many times you tried/needed to know/change the encoding of text files in Linux systems.
All this because you are using a wrong encoding format for your text files. The solution for this is very simple Just knowing the text files encoding will end your problems. You can either “for example” set your media player to use the correct encoding for your subtitles OR YOU CAN CHANGE THE ENCODING OF YOUR TEXT FILES TO A GLOBAL ACCEPTED ENCODING “UTF-8 FOR EXAMPLE”
This post is divided into two parts. In part 1: I’ll show you how to find and detect the text files encoding in Linux systems using Linux
file command available by default in all Linux distributions. In part 2: I’ll show you how to change the encoding of the text files using
iconv Linux command between CP1256 (Windows-1256, Cyrillic), UTF-8, ISO-8859-1 and ASCII character sets.
Part 1: Detect a File’s Encoding using
file Linux command
The file command makes “best-guesses” about the encoding. Use the following command to determine what character encoding is used by a file :
$ file -bi [filename]
||Don’t print filename (brief mode)|
||Print filetype and encoding|
Example 1 : Detect the encoding of the file “storks.srt”
$ file -ib storks.srt text/plain; charset=iso-8859-1
As you see, “storks.srt” file is encoded with
Example 2 : Detect the encoding of the file “The.Girl.on.the.Train.2016.1080p.WEB-DL.DD5.1.H264-FGT.srt”
$ file -ib The.Girl.on.the.Train.2016.1080p.WEB-DL.DD5.1.H264-FGT.srt text/plain; charset=utf-8
Here’s the “The.Girl.on.the.Train.2016.1080p.WEB-DL.DD5.1.H264-FGT.srt” file is
Finally, file command is perfect for telling you what exactly the encoding of a text file. You can use it to detect if your text file is encoded with UTF-8, WINDOWS-1256, ISO-8859-6, GEORGIAN-ACADEMY, etc…
Part 2: Change a File’s Encoding using
iconv Linux command
iconv Linux command you need to know the encoding of the text file you need to change it. Use the following syntax to convert the encoding of a file :
$ iconv -f [encoding] -t [encoding] [filename] > [output_filename]
||Convert characters from encoding|
||Convert characters to encoding|
Example 1: Convert a file’s encoding from iso-8859-1 to UTF-8 and save it to New_storks.srt
$ iconv -f iso-8859-1 -t utf-8 storks.srt > New_storks.srt
Here’s the New_storks.srt is UTF-8 encoded.
Example 2: Convert a file’s encoding from cp1256 to UTF-8 and save it to output.txt
$ iconv -f cp1256 -t utf-8 input.txt > output.txt
Here’s the output.txt is UTF-8 encoded.
Example 3: Convert a file’s encoding from ASCII to UTF-8 and save it to output.txt
$ iconv -f ascii -t utf-8 input.txt > output.txt
Here’s the output.txt is UTF-8 encoded.
Example 4: Convert a file’s encoding from UTF-8 to ASCII
Hints: 1. UTF-8 can contain characters that can't be encoded with ASCII, the iconv will generate the error message "illegal input sequence at position X" unless you tell it to strip all non-ASCII characters using the -c option. 2. With using iconv with the -c option, you could loose some characters from your text file.
$ iconv -c -f utf-8 -t ascii input.txt > output.txt
|-c||Omit invalid characters from output|
Finally, to list all the coded character sets known run
-l option with
iconv as follow:
$ iconv -l
||List known coded character sets|
Here’s the output of the above command:
The following list contain all the coded character sets known. This does not necessarily mean that all combinations of these names can be used for the FROM and TO command line parameters. One coded character set can be listed with several different names (aliases). 437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865, 866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4, 8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110, ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5, BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5, CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278, CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813, CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP866NAV, CP868, CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949, CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079, ........
Finally, I hope this article is useful for you.
If You Appreciate What We Do Here On Mimastech, You Should Consider:
- Stay Connected to: Facebook | Twitter | Google+
- Support us via PayPal Donation
- Subscribe to our email newsletters.
- Tell other sysadmins / friends about Us - Share and Like our posts and services
We are thankful for your never ending support.