Skip to content
Related Articles
Open in App
Not now

Related Articles

How to Convert a Document to PDF in Java?

Improve Article
Save Article
  • Last Updated : 23 Nov, 2022
Improve Article
Save Article

In software projects, there is quite often a requirement for conversion of a given file (HTML/TXT/etc.,) to a PDF file and similarly, any PDF file needs to get converted to HTML/TXT/etc., files. Even PDFs need to be stored as images of type PNG or GIF etc., Via a sample maven project, let us see the same. As it is the maven project, necessary dependencies need to be added in pom.xml

Essential Library is PDF2Dom:

<!-- To load the selected PDF file -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox-tools</artifactId>
    <version>2.0.25</version>
</dependency>
<!-- To load the selected PDF file -->

<!-- Required for conversion -->
<dependency>
    <groupId>net.sf.cssbox</groupId>
    <artifactId>pdf2dom</artifactId>
    <version>2.0.1</version>
</dependency>

A few more dependencies are also needed. iText is needed to extract the text from a given PDF file. POI is needed to create the .docx document.

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.10</version>
</dependency>
<dependency>
    <groupId>com.itextpdf.tool</groupId>
    <artifactId>xmlworker</artifactId>
    <version>5.5.10</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.15</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>3.15</version>
</dependency>

Example Maven Project

Let us start with the project structure and pom.xml and then will look for the required source code to convert from PDF to other formats as well as from other formats to HTML

 

pom.xml

XML




<?xml version="1.0"?>
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
                        http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <artifactId>pdf</artifactId>
    <name>pdf</name>
    <url>http://maven.apache.org</url>
  
    <parent>
        <groupId>com.gfg</groupId>
        <artifactId>parent-modules</artifactId>
        <version>1.0.0-SNAPSHOT</version>
    </parent>
  
    <dependencies>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox-tools</artifactId>
            <version>${pdfbox-tools.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>commons-logging</artifactId>
                    <groupId>commons-logging</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>net.sf.cssbox</groupId>
            <artifactId>pdf2dom</artifactId>
            <version>${pdf2dom.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>commons-logging</artifactId>
                    <groupId>commons-logging</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.itextpdf</groupId>
            <artifactId>itextpdf</artifactId>
            <version>${itextpdf.version}</version>
        </dependency>
        <dependency>
            <groupId>com.itextpdf.tool</groupId>
            <artifactId>xmlworker</artifactId>
            <version>${xmlworker.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>${poi-scratchpad.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.xmlgraphics</groupId>
            <artifactId>batik-transcoder</artifactId>
            <version>${batik-transcoder.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>${poi-ooxml.version}</version>
        </dependency>
        <dependency>
            <groupId>org.thymeleaf</groupId>
            <artifactId>thymeleaf</artifactId>
            <version>${thymeleaf.version}</version>
        </dependency>
        <dependency>
            <groupId>org.xhtmlrenderer</groupId>
            <artifactId>flying-saucer-pdf</artifactId>
            <version>${flying-saucer-pdf.version}</version>
        </dependency>
        <dependency>
            <groupId>org.xhtmlrenderer</groupId>
            <artifactId>flying-saucer-pdf-openpdf</artifactId>
            <version>${flying-saucer-pdf-openpdf.version}</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>${jsoup.version}</version>
        </dependency>
        <dependency>
            <groupId>com.openhtmltopdf</groupId>
            <artifactId>openhtmltopdf-core</artifactId>
            <version>${open-html-pdf-core.version}</version>
        </dependency>
        <dependency>
            <groupId>com.openhtmltopdf</groupId>
            <artifactId>openhtmltopdf-pdfbox</artifactId>
            <version>${open-html-pdfbox.version}</version>
        </dependency>
    </dependencies>
  
    <build>
        <finalName>pdf</finalName>
        <resources>
            <resource>
                <directory>src/main/resources</directory>
                <filtering>true</filtering>
            </resource>
        </resources>
    </build>
  
    <properties>
        <pdfbox-tools.version>2.0.25</pdfbox-tools.version>
        <pdf2dom.version>2.0.1</pdf2dom.version>
        <itextpdf.version>5.5.10</itextpdf.version>
        <xmlworker.version>5.5.10</xmlworker.version>
        <poi-scratchpad.version>3.15</poi-scratchpad.version>
        <batik-transcoder.version>1.8</batik-transcoder.version>
        <poi-ooxml.version>3.15</poi-ooxml.version>
        <thymeleaf.version>3.0.11.RELEASE</thymeleaf.version>
        <flying-saucer-pdf.version>9.1.20</flying-saucer-pdf.version>
        <open-html-pdfbox.version>1.0.6</open-html-pdfbox.version>
        <open-html-pdf-core.version>1.0.6</open-html-pdf-core.version>
        <flying-saucer-pdf-openpdf.version>9.1.22</flying-saucer-pdf-openpdf.version>
        <jsoup.version>1.14.2</jsoup.version>
    </properties>
  
</project>


Let us see important key files

1. PDF and HTML conversion

ConversionOfPDF2HTMLExample.java

In the below program, both methods are handled i.e.

a. generationOfHTMLFromPDF

Note: Conversion of  PDF to HTML cannot be predicted 100%, pixel-to-pixel result oriented. If the complexity of the PDF file is more, accuracy varies.

b. generationOfPDFFromHTML

Note: In html file, all tags need to properly closed and then only PDF can be generated

Java




import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.Writer;
  
import javax.xml.parsers.ParserConfigurationException;
  
import org.apache.pdfbox.pdmodel.PDDocument;
import org.fit.pdfdom.PDFDomTree;
  
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
  
public class ConversionOfPDF2HTMLExample {
  
    private static final String PDF = "src/main/resources/pdf.pdf";
    private static final String HTML = "src/main/resources/html.html";
  
    public static void main(String[] args) {
        try {
            generationOfHTMLFromPDF(PDF);
            generationOfPDFFromHTML(HTML);
        } catch (IOException | ParserConfigurationException | DocumentException e) {
            e.printStackTrace();
        }
    }
  
    private static void generationOfHTMLFromPDF(String filename) throws ParserConfigurationException, IOException {
        PDDocument pdf = PDDocument.load(new File(filename));
        PDFDomTree parser = new PDFDomTree();
        Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
        parser.writeText(pdf, output);
        output.close();
        if (pdf != null) {
            pdf.close();
        }
    }
  
    private static void generationOfPDFFromHTML(String filename) throws ParserConfigurationException, IOException, DocumentException {
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("src/output/html.pdf"));
        document.open();
        XMLWorkerHelper.getInstance().parseXHtml(writer, document, new FileInputStream(filename));
        document.close();
    }
}


2. PDF and Image Conversions

PDF can be converted to Images in many ways and one important way is Apache PDFBox again from image to PDF can be converted by using iText

ConversionOfPDF2ImageExample.java

In the below program, the following methods are handled

  • generationOfPDFFromImage
    •  Images are of type jpeg, jpg, gif, tiff, or png and can be loaded from disk
  • generationOfImageFromPDF
    • Apache PDFBox is an advanced tool. Each page of PDF has to be rendered by using PDFRenderer as a BufferedImage. Then ImageIOUtil is used to write the image as of types like JPEG, GIF, PNG, etc.,

Java




import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
  
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.tools.imageio.ImageIOUtil;
  
import com.itextpdf.text.BadElementException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
  
public class ConversionOfPDF2ImageExample {
  
    private static final String PDF = "src/main/resources/pdf.pdf";
    private static final String GIF = "https://media.giphy.com/media/l3V0x6kdXUW9M4ONq/giphy";
  
    public static void main(String[] args) {
        try {
            generationOfImageFromPDF(PDF, "png");
            generationOfImageFromPDF(PDF, "jpeg");
            generationOfImageFromPDF(PDF, "gif");
            generationOfPDFFromImage(JPG, "jpg");
            generationOfPDFFromImage(GIF, "gif");
        } catch (IOException | DocumentException e) {
            e.printStackTrace();
        }
    }
  
    private static void generationOfImageFromPDF(String filename, String extension) throws IOException {
        PDDocument document = PDDocument.load(new File(filename));
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        for (int page = 0; page < document.getNumberOfPages(); ++page) {
            BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
            ImageIOUtil.writeImage(bim, String.format("src/output/pdf-%d.%s", page + 1, extension), 300);
        }
        document.close();
    }
  
    private static void generationOfPDFFromImage(String filename, String extension)
            throws IOException, BadElementException, DocumentException {
        Document document = new Document();
        String input = filename + "." + extension;
        String output = "src/output/" + extension + ".pdf";
        FileOutputStream fos = new FileOutputStream(output);
        PdfWriter writer = PdfWriter.getInstance(document, fos);
        writer.open();
        document.open();
        document.add(Image.getInstance((new URL(input))));
        document.close();
        writer.close();
    }
  
}


3. PDF and Text Conversions

For this also Apache PDFBox is needed to get the text from PDF files and iText is required for text-to-pdf conversion.

Note: cannot preserve the formatting in a plain text file as it has text only

ConversionOfPDF2TextExample.java

Java




import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
  
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
  
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Element;
import com.itextpdf.text.Font;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
  
public class ConversionOfPDF2TextExample {
  
    private static final String PDF = "src/main/resources/pdf.pdf";
    private static final String TXT = "src/main/resources/txt.txt";
  
    public static void main(String[] args) {
        try {
            generationOfTxtFromPDF(PDF);
            generationOfPDFFromTxt(TXT);
        } catch (IOException | DocumentException e) {
            e.printStackTrace();
        }
    }
  
    private static void generationOfTxtFromPDF(String filename) throws IOException {
        File f = new File(filename);
        String parsedText;
        PDFParser parser = new PDFParser(new RandomAccessFile(f, "r"));
        parser.parse();
  
        COSDocument cosDoc = parser.getDocument();
  
        PDFTextStripper pdfStripper = new PDFTextStripper();
        PDDocument pdDoc = new PDDocument(cosDoc);
  
        parsedText = pdfStripper.getText(pdDoc);
  
        if (cosDoc != null)
            cosDoc.close();
        if (pdDoc != null)
            pdDoc.close();
  
        PrintWriter pw = new PrintWriter("src/output/pdf.txt");
        pw.print(parsedText);
        pw.close();
    }
  
    private static void generationOfPDFFromTxt(String filename) throws IOException, DocumentException {
        Document pdfDoc = new Document(PageSize.A4);
        PdfWriter.getInstance(pdfDoc, new FileOutputStream("src/output/txt.pdf"))
                .setPdfVersion(PdfWriter.PDF_VERSION_1_7);
        pdfDoc.open();
          
        Font myfont = new Font();
        myfont.setStyle(Font.NORMAL);
        myfont.setSize(11);
        pdfDoc.add(new Paragraph("\n"));
          
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String strLine;
        while ((strLine = br.readLine()) != null) {
            Paragraph para = new Paragraph(strLine + "\n", myfont);
            para.setAlignment(Element.ALIGN_JUSTIFIED);
            pdfDoc.add(para);
        }
          
        pdfDoc.close();
        br.close();
    }
  
}


4. PDF and DocX Conversions

Two libraries are needed. i.e. 

  • iText: Extract text from PDF
  • POI: To create the .docx document

ConversionOfPDF2WordExample.java

Java




import java.io.FileOutputStream;
import java.io.IOException;
  
import org.apache.poi.xwpf.usermodel.BreakType;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
  
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
  
public class ConversionOfPDF2WordExample {
  
    private static final String FILENAME = "src/main/resources/pdf.pdf";
  
    public static void main(String[] args) {
        try {
            generationOfDocFromPDF(FILENAME);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  
    private static void generationOfDocFromPDF(String filename) throws IOException {
        XWPFDocument doc = new XWPFDocument();
  
        String pdf = filename;
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
  
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            String text = strategy.getResultantText();
            XWPFParagraph p = doc.createParagraph();
            XWPFRun run = p.createRun();
            run.setText(text);
            run.addBreak(BreakType.PAGE);
        }
        FileOutputStream out = new FileOutputStream("src/output/pdf.docx");
        doc.write(out);
        out.close();
        reader.close();
        doc.close();
    }
}


Code Explanation Video:

Conclusion

In many stages of software projects, there are requirements for conversion of text, and image to PDF, and similarly conversion of data from PDF to text, image, and Docx format. The above examples help the best way to do this in Java.


My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!