Internal Implementation of Linux Sort Command
Prerequisite: SORT command in Linux/Unix with examples
The SORT command in the Linux operating system is used for sorting. It can be done either for files or for input on a command line given by the user. The sort command by default sorts files under the assumption that their data is ASCII. It sorts the files line by line.
The SORT command follows a variety of features for the output. The first is that the alphabetical lines will follow the lines with numbers. Lines containing lowercase letters will appear before lines containing the same character in uppercase. The SORT command has several options for sorting in various ways. Some of them include:
- -k: This option sorts the file based on the key number, which is stated in the option after k. For example, the “-k2” option would sort the file using the second column of the file.
- -n: This option sorts the file numerically since the default behavior is ASCII based.
- -b: This option ignores the leading blanks in the file while sorting.
- -d: This option sorts according to the dictionary order.
- -r: This option does a reverse sort or reverses the sorted result.
- -m: This option merges the already sorted files given as input.
- -u: This option removes duplicates in the file and then sorts it.
The SORT command can be used to sort large files that cannot fit in the main memory and thus reside in the external memory using the mechanism of external sorting, which is a class of sorting algorithms that can handle enormous volumes of data. The algorithm can be more specifically termed External Merge Sorting, and it essentially operates by first merging the sorted chunks together after it sorts the chunks into groups that can all fit in RAM.
Simply put, first divide the file into runs that fit in the main memory, then sort each run in the main memory, and finally merge the sorted runs into the external memory.
Steps for Internal Implementation of Linux Sort Command
The implementation of the SORT command differs a little on the basis of the option mentioned, but the primary steps remain the same. The implementation uses the external merge sorting algorithm as mentioned above. The steps for the implementation are as follows:
- First, we calculate the number of runs for sorting. This is calculated by dividing the number of lines in the file by the size of the main memory.
- Based on the number of runs and size of the main memory we then perform external sorting by two broad steps:
- Creating temporary files and sorting them
- Merging sorted files
Explaining the second and most important step in detail:
- For creating temporary files from the main file, the number of runs, the size of the main memory, and the number of lines in the file are required. Initially, we malloc an array of File pointers and character arrays for filenames according to the number of runs needed. The last run is to be treated differently since it can also contain a lesser number of characters than the other files. Now for each temp file pointer in the array, a file is malloced and opened with “w+” mode. In the same loop, data is written into the temporary file from the main file, character by character. The remaining data is written in the last temporary file.
- After creating input temporary files, in a similar way output temporary files are created to store the sorted temporary file data. In each loop, an output temporary file is created for the corresponding input temporary file and it is then sorted by performing three operations:
- Readline: Reads the input temporary file line by line and stores the data in the malloced pointer of the structure line
- Column Separate: This is basically used due for sorting by option “-k” to sort the file based on key number i.e., a specific column number. The default value of the key number is set to 1, so it means sorting takes place based on the first column of the file. In this function, the data is separated column-wise and stored in another 2D array according to the key number
- MergeSort: Here the actual sorting of the arrays takes place and the temporary files are thus sorted
- After sorting the temporary files individually, they need to be merged in a sorted manner and then get stored in a single file. This is done by merge sort using a heap. Implementation of heap i.e., methods like heapInit, heapInsert, heapRemove is also done according to the structures used and required.
The primary steps for implementing the sort command remain almost the same. There are some specific changes depending on the options given to the sort command. Some of them include the following:
- For the “-k” option, a particular key number is provided based on which the sorting is done
- For the “-r” option, sorting needs to be done in a reverse manner. This can either be done by actually sorting in reverse or sorting normally and in the end reversing the output.
- For the “-m” option, the files are already sorted so we only need to call the merging sorted files method.
- For the “-n” option, sorting is performed according to numeric order and not ASCII
Code Snippets for Implementation in the C Language
Creating temporary files and filenames for them:
Writing data to the temporary file from the main file character by character:
Creating output temporary files and sorting the individual temporary files:
Column Separate function:
Merging the sorted files using Heap functions and storing them in one output file:
Combining these atomic functions, creating the required heap and line structures, writing appropriate functions implementing heap methods, and writing the main method with all the options and calling functions completes the implementation of the Linux sort command.
Above are the snapshots of running the executable file of the implementation for two different options and files.