Detecting Delimiter in Text using detect_delimiter in Python
Sometimes while working with a large corpus of text, we can have a problem in which we try to find which character is acting as a delimiter. This can be an interesting and useful utility while working with a huge amount of data and judging the delimiter. A way to solve this problem is discussed in this article using the Python library of detect_delimiter.
To install this module type the below command in the terminal.
pip install detect_delimiter
The first step is to check for all the whitelist characters’ presence in the input text, if found, then those characters are counted for most frequencies and a maximum of one is returned, ignoring all from the blacklist list if provided. If no delimiter is from the whitelist, then characters avoiding blacklist characters are computed for maximum frequency, if found, that character is returned as the delimiter. If still delimiter is not found, default is returned as a delimiter if provided, else None is returned.
Syntax: detect(text:str, text:str, default=None, whitelist=[‘,’, ‘;’, ‘:’, ‘|’, ‘\t’], blacklist=None)
text : The input string to test for delimiter.
default : The default value to output in case no valid delimiter is found.
whitelist : The first set of characters to be checked for delimiters, if these are found, they are treated as delimiters. Useful in cases one knows out of which delimiters are possible. Defaults to [‘,’, ‘;’, ‘:’, ‘|’, ‘\t’].
blacklist : By default all digits, alphabets and full stop are not considered as blacklist, In case more values one needs to avoid being tagged as delimiters, these will get avoided in check.
Example 1: Working with detect() and default
In this, few examples of detecting the delimiters are demonstrated along with the use of default.
Example 2: Using blacklist and whitelist parameters
Providing whitelist parameter prioritizes any particular delimiter even if its frequency is less than nonwhitelisted delim. The blacklist parameter can help to ignore any delimiter.