Salmon Run: Table Extraction from PDFs using Multimodal (Vision) LLMs

Sunday, June 30, 2024

Table Extraction from PDFs using Multimodal (Vision) LLMs

Couple of weeks ago a colleague and I participated in an internal hackathon where the task was to come up with an interesting use case using the recent multi-modal Large Language Models (LLMs). Multi-modal LLMs take not only text inputs via their prompt like earlier LLMs, but can also accept non-text modalities such as images and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI, Gemini 1.5 from Google, and Claude-3.5 Sonnet from Anthropic. The hackathon provided access to GPT-4o through Azure, Microsoft's Cloud Computing Platform. We did not win, there were other entries that were better than ours both in terms of the originality of their idea as well as quality of their implementations. However, we learned some cool new things during the hackathon, and figured that these might be of general interest to others as well, hence this post.

Our idea was to use GPT-4o to extract and codify tables found in academic papers as semi-structured data (i.e. JSON). We could then either query the JSON data for searching within tables, or convert it to Markdown for downstream LLMs to query them easily via their text interface. We had originally intended to extend the idea to figures and charts, but we could not get that pipeline working end to end.

Here is what our pipeline looked like.

Academic papers are usually available as PDFs. We use the PyMuPDF library to split the PDF file into a set of image files, where each image file corresponds to a page in the paper.
We then send each page image through the Table Transformer, which returns bounding box information for each table it detects in the page, as well as a confidence score. The Table Transformer model we used was microsoft/table-transformer-detection.
We crop out each table from the pages using the bounding box information, and then send each table to GPT-4o as part of a prompt asking to convert it to a JSON structure. GPT-4o responds with a JSON structure representing the table.

This pipeline was based on my colleague's idea. I like how it progressively simplifies the task by splitting each page of the incoming PDF into its own image, then uses a pre-trained Table Transformer to crop out the tables from them, and only then passes the table to GPT-4o to convert to JSON. That table image is passed into the prompt as a "data URL" which is just the base-64 encoding of the image formatted as "data:{mime_type};base64,{base64_encoded_data}. The Table Transformer, while not perfect, proved remarkably successful at identifying tables in the text. I say remarkable because we used a pre-trained model, but perhaps it is not that remarkable once you consider that it probably trained on tables in academic papers as well.

Our prompt for GPT-4o looked something like this:

System: You are an AI model that specializes in detecting the tables and extracting, interpreting table content from images. Follow below instruction step by step:
1. Recognize whether the given image is table or not, if it's not a table print "None". if it's a table go to next step.
2. accurately convert the table's content into a structured structured Json format

general instruction:
1. do not output anything extra. 
2. a table must contains rows and columns

User: Given the image, detect whether it's a table or not, if it's a table then convert it to Json format
{image_data_url}

For the figure pipeline, I tried to use an OWL-VIT (Vision Transformer for Open World Localization) model in place of the Table Transformer. But it was not as successful at detecting figures in the text, probably since SAM seems to be fine-tuned to detect objects in natural images. Unfortunately we couldn't find a pre-trained model that would work for this particular case. Another issue was converting the fgure into a semi-structured JSON representation, we ended up asking GPT-4o to describe the image as text.

One suggestion by some of my TWIML non-work colleagues was to ask GPT-4o to return the bounding boxes for the images it finds in it, and then use that to extract the figures to send to GPT-4o for describing. It didn't work unfortunately, but was definitely worth trying. As LLMs get more and more capable, I think it makes sense to rethink our pipelines to delegate more and more work to the LLM. Or at least verify that it can't do something before moving on to older (and harder to implement) solutions.