HTML Parser in C/C++
HTML Parser is a program/software by which useful statements can be extracted, leaving html tags (like <h1>, <span>, <p> etc) behind.
Examples:
Input: <h1>Geeks for Geeks</h1>
Output: Geeks for Geeks
Explanation- <h1> and </h1> are opening and closing heading tags, so they got parsed leaving “Geeks for Geeks” as the output.Input: <p> Geeks for Geeks</p>
Output: Geeks for Geeks
Explanation- <p> and </p> are opening and closing paragraph tags, so they get parsed and the parser ignores space character, leaving “Geeks for Geeks” as the output.
Approach: Let the input string be S of size N. Follow the steps below to solve the problem:
- Declare two variables, start and end to point to the starting and ending point of the statement.
- Traverse the string, S uses the variable i and if S[i] is equal to ‘>’, update the start variable to i+1 and break out of the loop.
- Remove the blank spaces from the start by running a loop while S[start] is equal to ‘ ‘, and increment the start variable by 1 in each iteration.
- Again, traverse the string, S from start using the variable i and if S[i] is equal to ‘<‘, update the end to i-1 and break out of the loop.
- Run a loop and print the characters of the string S in the range [start, end].
Below is the implementation of the above approach in C language:
// C program for the above approach #include <stdbool.h> #include <stdio.h> #include <string.h> // Function to parse the HTML code void parser( char * S) { // Store the length of the // input string int n = strlen (S); int start = 0, end = 0; int i, j; // Traverse the string for (i = 0; i < n; i++) { // If S[i] is '>', update // start to i+1 and break if (S[i] == '>' ) { start = i + 1; break ; } } // Remove the blank spaces while (S[start] == ' ' ) { start++; } // Traverse the string for (i = start; i < n; i++) { // If S[i] is '<', update // end to i-1 and break if (S[i] == '<' ) { end = i - 1; break ; } } // Print the characters in the // range [start, end] for (j = start; j <= end; j++) { printf ( "%c" , S[j]); } printf ( "\n" ); } // Driver Code int main() { // Given Input char input1[] = "<h1>This is a statement</h1>" ; char input2[] = "<h1> This is a statement with some spaces</h1>" ; char input3[] = "<p> This is a statement with some @ #$ ., / special characters</p> " ; printf ( "Parsed Statements:\n" ); // Function Call parser(input1); parser(input2); parser(input3); return 0; } |
Parsed Statements: This is a statement This is a statement with some spaces This is a statement with some @ #$ ., / special characters
Below is the implementation of the above approach in C++ language:
// C++ program for the // above approach #include <bits/stdc++.h> using namespace std; // Function to parse the // HTML code void parser( char * S) { // Store the length of the // input string int n = strlen (S); int start = 0, end = 0; // Traverse the string for ( int i = 0; i < n; i++) { // If S[i] is '>', update // start to i+1 and break if (S[i] == '>' ) { start = i + 1; break ; } } // Remove the blank space while (S[start] == ' ' ) { start++; } // Traverse the string for ( int i = start; i < n; i++) { // If S[i] is '<', update // end to i-1 and break if (S[i] == '<' ) { end = i - 1; break ; } } // Print the characters in the // range [start, end] for ( int j = start; j <= end; j++) { cout << S[j]; } cout << endl; } // Driver Code int main() { // Given Input char input1[] = "<h1>This is a statement</h1>" ; char input2[] = "<h1> This is a statement with some spaces</h1>" ; char input3[] = "<p> This is a statement with some @ #$ ., / special characters</p> " ; cout << "Parsed Statements:\n" ; // Function Call parser(input1); parser(input2); parser(input3); return 0; } |
Parsed Statements: This is a statement This is a statement with some spaces This is a statement with some @ #$ ., / special characters
Time Complexity: O(N)
Auxiliary Space: O(1)
Note: This program parses only one statement at a time.
Please Login to comment...