Sunday, June 30, 2024

Table Extraction from PDFs using Multimodal (Vision) LLMs

Couple of weeks ago a colleague and I participated in an internal hackathon where the task was to come up with an interesting use case using the recent multi-modal Large Language Models (LLMs). Multi-modal LLMs take not only text inputs via their prompt like earlier LLMs, but can also accept non-text modalities such as images and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI, Gemini 1.5 from Google, and Claude-3.5 Sonnet from Anthropic. The hackathon provided access to GPT-4o through Azure, Microsoft's Cloud Computing Platform. We did not win, there were other entries that were better than ours both in terms of the originality of their idea as well as quality of their implementations. However, we learned some cool new things during the hackathon, and figured that these might be of general interest to others as well, hence this post.

Our idea was to use GPT-4o to extract and codify tables found in academic papers as semi-structured data (i.e. JSON). We could then either query the JSON data for searching within tables, or convert it to Markdown for downstream LLMs to query them easily via their text interface. We had originally intended to extend the idea to figures and charts, but we could not get that pipeline working end to end.

Here is what our pipeline looked like.

  1. Academic papers are usually available as PDFs. We use the PyMuPDF library to split the PDF file into a set of image files, where each image file corresponds to a page in the paper.
  2. We then send each page image through the Table Transformer, which returns bounding box information for each table it detects in the page, as well as a confidence score. The Table Transformer model we used was microsoft/table-transformer-detection.
  3. We crop out each table from the pages using the bounding box information, and then send each table to GPT-4o as part of a prompt asking to convert it to a JSON structure. GPT-4o responds with a JSON structure representing the table.

This pipeline was based on my colleague's idea. I like how it progressively simplifies the task by splitting each page of the incoming PDF into its own image, then uses a pre-trained Table Transformer to crop out the tables from them, and only then passes the table to GPT-4o to convert to JSON. That table image is passed into the prompt as a "data URL" which is just the base-64 encoding of the image formatted as "data:{mime_type};base64,{base64_encoded_data}. The Table Transformer, while not perfect, proved remarkably successful at identifying tables in the text. I say remarkable because we used a pre-trained model, but perhaps it is not that remarkable once you consider that it probably trained on tables in academic papers as well.

Our prompt for GPT-4o looked something like this:

System: You are an AI model that specializes in detecting the tables and extracting, interpreting table content from images. Follow below instruction step by step:
1. Recognize whether the given image is table or not, if it's not a table print "None". if it's a table go to next step.
2. accurately convert the table's content into a structured structured Json format

general instruction:
1. do not output anything extra. 
2. a table must contains rows and columns

User: Given the image, detect whether it's a table or not, if it's a table then convert it to Json format
{image_data_url}

For the figure pipeline, I tried to use an OWL-VIT (Vision Transformer for Open World Localization) model in place of the Table Transformer. But it was not as successful at detecting figures in the text, probably since SAM seems to be fine-tuned to detect objects in natural images. Unfortunately we couldn't find a pre-trained model that would work for this particular case. Another issue was converting the fgure into a semi-structured JSON representation, we ended up asking GPT-4o to describe the image as text.

One suggestion by some of my TWIML non-work colleagues was to ask GPT-4o to return the bounding boxes for the images it finds in it, and then use that to extract the figures to send to GPT-4o for describing. It didn't work unfortunately, but was definitely worth trying. As LLMs get more and more capable, I think it makes sense to rethink our pipelines to delegate more and more work to the LLM. Or at least verify that it can't do something before moving on to older (and harder to implement) solutions.

Sunday, June 23, 2024

Book Report: Pandas Workout

Unlike many Data Scientists, I didn't automatically reach for Pandas when I needed to analyze data. I came upon this discipline (Data Science) as a Java Software Engineer who used Python for scripting, so I was quite comfortable operating on JSON / CSV / text files directly, loading data into relational databases and running SQL against them, and building visualizations with Matplotlib. So when Pandas first hit the scene, I thought it was a nice library, but I just didn't see the logic in spending time to learn another interface to do the same things I could do already. Of course, Pandas has matured since then (and so have I, hopefully), and when faced with a data analysis / preparation / cleanup task, I often now reach out not only for Pandas, but depending on the task, also its various incarnations such as PySpark, Dask Dataframes and RAPIDS cuDF. When I use Pandas (and its various incarnations) I often find myself depending heavily on Stack Overflow (and lately Github Copilot) for things I know can be done but not how. To some extent I blame this on never having spent the time to understand Pandas in depth. So when I was offered the chance to review Pandas Workout by Reuven Lerner, I welcomed it as a way to remedy this gap in my knowledge.

The book is about Pandas fundamentals rather than solving specific problems with Pandas. For that you will still want to look up Stack Overflow :-). In fact, in the foreword the author specifically targets my demographic (needs to look up Stack Overflow when solving problems with Pandas). But he promises that after reading the book you will understand why some solutions are better than others.

Pandas started as an open source project by Wes McKinney, and has grown somewhat organically into the top Data Science toolkit that is today. As a result, there are often multiple ways to do something in Pandas. While all these ways may produce identical results, their performance characteristics may be different, so there is usually an implicit "right" way. The book gives you the mental model to decide which among the different approaches is the "right" one.

The book is organized into the following chapters. Each chapter covers a particular aspect of Pandas usage. I have included a super-short TLDR style abstract for each chapter for your convenience.

  1. Series -- Pandas Series objects are the basic building block of Pandas and represent a typed sequence of data, that are used to construct DataFrames and Indexes. Many methods on the Series object apply in a similar way to DataFrames as well. This is a foundational chapter, understanding this will help with future chapters.
  2. Data Frames -- DataFrames represent tabular data as a sequence of Series, where each Series object represents a column in the table. Pandas inherits the idea of DataFrames from R, and the incarnations I listed (and a few that I didn't) use DataFrame as a basic abstraction as well. This chapter teaches you how to select from and manipulate DataFrames. Unless you've used Pandas extensively before, there is a high chance you will learn something useful new tricks here (I did, several of them).
  3. Import and Export -- covers reading and writing CSV and JSON formats to and from DataFrames. Covers some simple sanity checks you can run to verify that the import or export worked correctly. I learned about the pd.read_html method here, probably not that useful, but interesting to know!
  4. Indexes -- Indexes are used by Pandas to efficiently find data in DataFrames. While it may be possible to get by without Indexes, your Pandas code would take longer to run and consume more resources. The chapter deals with indexing techniques. I happened to know a lot of them, but there were a few that I didn't, especially the techniques around pivot tables.
  5. Cleaning -- this chapter teaches a skill that is very fundamental to (and maybe even the bane of) a Data Scientist's job. Statistics indicate that we spend 80% of our time cleaning data. Along with the techniques themselves (remove / interpolate / ignore), this chapter contains commentary that will help you frame these decisions on your own data cleaning tasks.
  6. Grouping, Joining and Sorting -- these three operations are so central to data analysis, so much so that SQL has special keywords for each operation (JOIN, GROUP BY and ORDER BY). This chapter covers various recipes to do these operations efficiently and correctly in Pandas.
  7. Advanced Grouping, Joining and Sorting -- this chapter goes into greater detail on how to combine these operations to deal with specific use-cases, the so-called "split-apply-combine" technique, including the concept of a general aggregation function agg. It also shows how to do method chaining using assign.
  8. Midway Project -- describes a project and asks questions that you should be able to answer from the data using the techniques you have learned so far. Comes with solutions.
  9. Strings -- one reason I don't have much experience with Pandas is because it is focused on numeric tables for the most part. However, Pandas also has impressive string handling facilities via the str accessor. This chapter was something of an eye-opener for me, showing me how to use Pandas for text analysis and pre-processing.
  10. Dates -- this chapter describes Pandas date and time handling capabilities. This can be useful when trying to work with time series or when trying to derive numerical features from columns containing datetime objects to combine with other numeric or text data.
  11. Visualizations -- this chapter describes visualization functionality you can invoke from within Pandas, that are powered either by Matplotlib or Seaborn. This is more convenient than exporting the data to Numpy and using the two packages to draw the charts.
  12. Performance -- performance has been a focus for most of the preceding chapters in this book. However, the recipes in this chapter are in the advanced tricks category, and include converting strings to categorical values, optimizing reads and writes using Apache Arrow backed formats, and the using fast special purpose functions for specific purposes.
  13. Final Project -- describes a project similar to the Midway project with questions that you should be able to answer from the data using the techniques you have learned so far.

I think the book has value beyond just teaching Pandas fundamentals though. The author sprinkles insights about Data Analysis and Data Science throughout the book, around learning to structure the problem and planning the sequence of steps that are best suited for the tools at hand, the importance of critical thinking, the importance of knowing the data and interpreting the results of the analysis, etc.

Each exercise (there are 50 in all) involves downloading some dataset, dealing with subjects as diverse as tourism, taxi rides, SAT scores, parking tickets, olympic games, oil prices, etc. I think the information about the availability of such datasets (and possibly related datasets) can also be very valuable to Data Scientists for their future projects.

I think the popularity of Pandas is because of the same reason as the popularity of Jupyter Notebooks. It is a nice, self-contained platform the allows a Data Scientist to demonstrate a series of data transformations from problem to solution in a clear, concise and standard manner, not only to customers, but to other Data Scientists as well. More than any other reason, I feel that this will continue to drive the popularity of Pandas and its various incarnations, and as a Data Scientist, it makes sense to learn how to use it properly. And the book definitely fulfils its promise of teaching you how to do that.