Skip to content

How to extract specific text by colour using iTextsharp?

  • by
  • 2 min read

PDF files are a great way to ensure the document you created maintains its formatting and attributes the way you intended, regardless of whether the machine on the file is opened. That said, it also makes editing PDF files a bit difficult. 

Things become especially problematic when you work with PDF manipulation in code. Luckily, libraries like iTextSharp help developers create, edit, inspect and maintain PDF documents. 

In this article, we’re talking about how you can extract specific text by colour using iTextSharp. 

Also read: How to fix ‘Internal exception java.net.socketexception connection reset realms’?


Extracting text based on colour

There’s no way of extracting text directly based on its highlight or font colour in iTextSharp. That said, you can use the ExtractText() method, fill in the formatting details of a specific text and run them against a reference colour to get what you want. 

A simple script to do so would look like this.

PdfLoadedDocument pdf;private void Form1_Load(object sender, System.EventArgs e)
{
            //Loads the PDF document 
            pdf = new PdfLoadedDocument(@"link/to/file.pdf");

            //Enter colour name here
            textBox1.Text = "Blue";
}
 
private void button1_Click(object sender, EventArgs e)
{
           
            List<TextData> TextFormat = new List<TextData>();
 
            string text = null;
            //Convert the colour string into an actual colour value
            Color color = Color.FromName(textBox1.Text);
            //Check for incorrect colour name
            if(color.ToArgb()==0)
            {
                MessageBox.Show("Enter valid colour name");
                return;
            }
 
            for (int i = 0; i < pdf.Pages.Count; i++)
            {
                //Load PDF page
                PdfPageBase page = pdf.Pages[i];
 
                //Extract the text with the specified formatting attributes  
                string pageTexts = page.ExtractText(out TextFormat);
 
                for (int j = 0; j < TextFormat.Count; j++)
                {
                    //Check for target colour
                    if (TextFormat[j].FontColor.ToArgb() == color.ToArgb())
                    {
                        //Write text to file
                        text += TextFormat[j].Text;
                    }
                }
            }
            if (text != null)
                MessageBox.Show(text);
            else
                MessageBox.Show("The document doesn't have any " + textBox1.Text + " coloured text");
}

Just as is the case with everything in coding, there are different methods to achieve the same result using different libraries or approaches. If you’re starting with iTextSharp, this is probably the simplest one to understand.

Also read: Fix: Unknown error: soap-error: encoding: object has no uirequestid property

>