Skip to content
Related Articles

Related Articles

Clean Web Scraping Data Using clean-text in Python

View Discussion
Improve Article
Save Article
  • Difficulty Level : Medium
  • Last Updated : 31 Mar, 2022
View Discussion
Improve Article
Save Article

If you like to play with API’s or like to scrape data from various websites, you must’ve come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. 

In this article, we are going to explore a python library called clean-text which will help you to clean your scraped data in a matter of seconds without writing any fancy, long code. Let’s begin

Installation

Use the following command

pip install clean-text

Note: CleanText package requires Python 3.7 or greater.

Syntax

cleantext.clean_words( text , {operations})

  • text: string
  • operations: mentions below 

Different cleantext operations:

The clean-text function provides a range of arguments that specifies how to clean the given raw text input and return the cleaned text in the form of a string. Here is the list of arguments that you can use to clean your required data.

  • fix_unicode: Fix Unicode errors, takes the value as True or False
  • to_ascii: Translate to ASCII representation, takes the value as True or False
  • lower: Convert input to lowercase, takes the value as True or False            
  • no_line_breaks: Remove all the line breaks
  • no_urls: replace all URLs with a special token
  • no_emails: replace all email addresses with a special token
  • no_phone_numbers: replace all phone numbers with a special token
  • no_numbers: replace all numbers with a special token
  • no_digits: replace all digits with a special token
  • no_currency_symbols: replace all currency symbols with a special token   
  • no_punct : Remove all punctuation       
  • replace_with_punct=”” : Replace all punctuation with given input
  • replace_with_url=”<URL>” : Replace data URL’s with given input
  • replace_with_email=”<EMAIL>” : Replace data email’s with given input
  • replace_with_phone_number=”<PHONE>” : Replace phone numbers with given input
  • replace_with_number=”<NUMBER>” : Replace numbers with given input
  • replace_with_digit=”0″ : Replace digits with given input
  • replace_with_currency_symbol=”<CUR>” : Replace data email’s with given input
  • lang=“en” (Only English and German languages are supported)

Code Implementation:

Python3




# import library
from cleantext import clean
 
# input string
text = """
    A bunch of \\u2018new\\u2019 references,
    including [Moana]. »Yóù àré rïght <3!«
    """
 
print(clean(text=text,
            fix_unicode=True,
            to_ascii=True,
            lower=True,
            no_line_breaks=False,
            no_urls=False,
            no_emails=False,
            no_phone_numbers=False,
            no_numbers=False,
            no_digits=False,
            no_currency_symbols=False,
            no_punct=False,
            replace_with_punct="",
            replace_with_url="This is a URL",
            replace_with_email="Email",
            replace_with_phone_number="",
            replace_with_number="123",
            replace_with_digit="0",
            replace_with_currency_symbol="$",
            lang="en"
            ))


Output:

 


My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!