How to remove duplicate lines from text files on Linux/UNIX systems

undefined

One day, I was preparing a blacklist to block the porn websites on a web proxy website called “anaproxy.com”. After checking the server access logs, I could extract all the accessed urls from it’s access log, I did some operations on the urls to extract only the domain “actually I extracted the domain and it’s subdomain”, I separated those domains in alphabetic files for further processing. When I checked the first file I found it contains a huge number of duplicated lines “domains and subdomains” which forced me the search for an accurate way to remove only the duplicated lines and only keep one line of each duplicated lines, Why I needed accurate method for removing the duplicated lines?

Simply, because I’m filtering a porn websites domain, I need to block all of them and not allowing any single domain.

How can I remove those duplicated lines my text files?

We can remove the duplicate lines using three options, I’ll list them here and show the difference and the accuracy of each one of them, and of course I’ll recommend one the one I use.

I’ll use the file called “A-Websites” which contains all collected porn websites start with letter “a”, before start our first option, let’s count the line inside this file, I ran the following command:

$ wc -l A-Websites
464 A-Websites

We’ve  464 porn domain starts with letter “a”, some of them are of course duplicated. Let’s start.

Option 1: Using awk “pattern scanning and processing language” Command

Simply, run the following command:

$ awk '!seen[$0]++' A-Websites > A-Websites-AWK

I saved it’s output with name “A-Websites-AWK”, remember that “A-Websites” contains 464 line, now I check the total lines of the “A-Websites-AWK”:

$ wc -l A-Websites-AWK 
447 A-Websites-AWK

Hooray, I did it I removed all duplicated lines “removed 17 duplicated domain”, wait a second, when I opened the file for checking it I found some duplicated lines, Here’s what I found “not real porn domains, I give some example domains”:

$ cat A-Websites-AWK

Amazon.com
amazon.com
anaproxy.com
mimastech.com
......
Mimastech.com
Anaproxy.com

Too bad, awk is case sensitive, I still have some duplicated domains. But the good news is awk works fine on unsorted lines inside the text files.

Let’s move the the next option we’ve.

Option 2: Using uniq “report or omit repeated lines” Command

uniq command is our second option, we still work on the same file “A-Websites” which contains 464 lines with some duplicated. There are some notes about using uniq command:

  • uniq does not detect repeated lines unless they are adjacent.
  • We must sort the lines inside our text file first before using uniq

So, it only compare two adjacent lines, and this is a very important point. First we will sort our “A-Websites” file using the following command:

$ sort A-Websites > A-Websites-Sorted

Now, we ready to use uniq command on our sorted file “A-Websites-Sorted”, but let’s check the number of lines/domains inside it using the following command:

$ wc -l A-Websites-Sorted 
464 A-Websites-Sorted

We still have the 464 lines we working on. We have three cases of using uniq command:

  • Case 1: Using uniq Without Any Options.

The simple form of using uniq is through pipe the output of cat command to uniq without any options, see the following command:

$ cat A-Websites-Sorted | uniq > A-Websites-Uniq-only

Check the new file “A-Websites-Uniq-only” line, by running:

$ wc -l A-Websites-Uniq-only 
447 A-Websites-Uniq-only

we see a 447 line exists on this file, which is the same result we got from option 1 “using awk command”, Now, I’ll open the “A-Websites-Uniq-only” for seeing check using cat command as follow “again not real porn domains, I give some example domains” :

$ cat A-Websites-Uniq-only

amazon.com
Amazon.com
.......
anaproxy.com
Anaproxy.com
.......
mimastech.com
Mimastech.com

Too bad, uniq is case sensitive, I still have some duplicated domains, and uniq works only on sorted lines inside the text files “we used the sort command in the first of this option 2”. If you use it on an unsorted files, it’ll give you a very wrong output.

  • Case 2: Using uniq With -i Option.

To overcome the duplicated lines with differences only some uppercase letters and some lowercase letters “remove the case sensitivity from using uniq in the above case”, we use uniq with option “-i", this will give us exactly what we want.

Again, we still working on “A-Websites-Sorted”, Now run the following command:

$ cat A-Websites-Sorted |uniq -i > A-Websites-Uniq-i

Check the new file “A-Websites-Uniq-i” line, by running:

$ wc -l A-Websites-Uniq-i 
444 A-Websites-Uniq-i

we see a 444 line exists on this file, which is three lines less than output from using uniq only. After seeing check using cat command for this file, I didn’t find any duplicated lines “porn domains in this case” with case sensitive letters”, uniq -i” removed the duplicated lines with upper case letters and only kept one lines only of each duplicated lines. Is this the correct output for what we want?, Answer will be found in the conclusion part.

  • Case 3: Using uniq With -u Option.

Sure, you saw before reading this post some online articles give command “uniq -u” as the best solution for removing duplicated lines, they may be right or wrong, here’s we show the difference between using “uniq -u”  and “uniq -i“, let’s use “uniq -u” against our file and check it’s output,  run the following command:

$ cat A-Websites-Sorted |uniq -u > A-Websites-Uniq-u

Check the new file “A-Websites-Uniq-u” line, by running:

$ wc -l A-Websites-Uniq-u
430 A-Websites-Uniq-u

Amazing, we see a 430 line exists on this file, which is 17 lines less than output from using uniq only. After seeing check using cat command for this file, I didn’t find any duplicated lines, and didn’t find any duplicated domain found in our first case “using uniq only” i.e “didn’t find neither anaproxy.com nor Anaproxy.com”. uniq -u removes all duplicated lines and not keep any single line from each duplicated group “it outputs only the unique lines“, this is not what I and you “in most cases” want, we want to remove all duplicated porn domains and keep only one for our firewall to block it

uniq -u” will give you a wrong output, or a correct output “in a very rare cases. according to your needs”, as it removes all duplicated, case insensitive lines and not output any single line from them.

Option 3: Using sort “sort lines of text files” Command

Last but not least, We can use sort command with options “fu“. A single command that gives us exactly what we need, remember in order to use uniq you must use sort first as uniq works only on sorted lines inside a file. So, why I use two commands to get what I need, I can only use one “sort” with some options, and this command what we recommend for all of you.

Let’s use “sort -fu” against our original file “the unsorted file” which called “A-Websites” and check it’s output,  run the following command:

$ sort -fu A-Websites > A-Websites-Sort-fu

Check the new file “A-Websites-Sort-fu” line, by running:

$ wc -l A-Websites-Sort-fu 
444 A-Websites-Sort-f

we see a 444 line exists on this file. After seeing check using cat command for this file, I didn’t find any duplicated lines “porn domains in this case” with case sensitive letters”, sort -fu” removed the duplicated lines with upper case letters and only kept one lines only of each duplicated lines. It gives us exactly the same output when using “uniq -i” command, but in one step. This is the command we use and recommend it for you. A single command uses less processing power on our machine than two commands especially in case of large files.

Conclusion

sort -fu” in most/all cases is your command that saves your time ans resources. A one command which gives you a clear output. In  rare case if you need to remove all duplicated lines group and not keep even a single line, then “uniq -u” on a sorted text file is your command.

I hope this article is good enough for you.
See you in other articles.

undefined

If You Appreciate What We Do Here On Mimastech, You Should Consider:

  1. Stay Connected to: Facebook | Twitter | Google+
  2. Support us via PayPal Donate
  3. Subscribe to our email newsletters.
  4. Tell other sysadmins / friends about Us - Share and Like our posts and services

We are thankful for your never ending support.

Leave a Reply

Your email address will not be published. Required fields are marked *