If you’ve no interest in Linux command-line things, then this isn’t likely to be that interesting for you. Sorry about that.
Due to my work I’m often in the position where I need to find the differences between two lists. Now, some people may point to ‘diff‘ or ‘sdiff‘ as tools appropriate to the job – and in many cases they are exactly what is required.
However, what I often need is to compare two lists, breaking them into different categories – new lists – that I can then pass through other processing to find patterns, and perhaps find out why they’re different. The categories I usually need are:
- Items exist in both lists
- Items exist in only the first list
- Items exist in only the second list
One caveat though: this technique relies on each having no duplicates. That is, you cannot have two ‘jsmiths’ within the same list.
The ‘magic’ in this technique is to use a couple of options to ‘uniq‘. The first option is ‘-d‘ which says to only output entries that are duplicates. The second option is ‘-u‘ which says to only output entries that have no duplicates.
To begin, we need to find out what items in both lists are common – that is: exist in both lists. If we have list ‘a.txt’ and ‘b.txt’ then we can do this:
> cat a.txt b.txt | sort | uniq -d > common.txt
- Concatenates our two files (cat) into one list
- Sorts the new list so that any entries that exist in both files will be next to each other (sort)
- Finds and prints only those entries which have two entries next to each other (uniq -d)
We now have a file ‘common.txt’ that contains a list of only those entries that are common to both lists.
From here we can now find those that are only in list ‘a.txt’ or those only in list ‘b.txt’
Only those in a.txt:
> cat a.txt common.txt | sort | uniq -u > onlyina.txt
Or in b.txt:
> cat b.txt common.txt | sort | uniq -u > onlyinb.txt
Put simply, these:
- Concatenate the common list, and the list of interest.
- Sort the entries, so duplicates (i.e. those in both the list of interest and the common list are next to each other)
- Find and print out only those values that do not have duplicates (uniq -u). This means we get only those that are not common to both original lists.
So from this we now have our three categories:
- Those entries that appear in both lists (common.txt)
- Those entries that only appear in a.txt (onlyina.txt)
- Those entries that only appear in b.txt (onlyinb.txt)
Any one of these can now go through further processing to try determining why a particular item is in a particular category.