Salmon Run: Resurrection

2022 has came and gone, and without a single blog post from my end. To be fair, my blogging output has been steadily decreasing over the last few years, so you would be justified in thinking of it as a somewhat inevitable trend. In other words, we had a good run, etc. Thinking back, one possible reason for my decreasing output is that my previous job was more product focused and my current one is more research focused -- thus, almost everything I do, experimental or otherwise, could be of potential use to the company, so it becomes hard to find material to blog about. Thus, over time, I found myself struggling to maintain consistent posting schedules, and the blog eventually fell by the wayside.

However, looking back, I see that there were things that I could have written about, that I was just too lazy (or not motivated enough) to do, and for that I apologize, and promise to do better going forward. For those of you that have read my blogs from before, you know that I primarily did search with an interest in many search adjacent systems, such as Natural Language Processing (NLP), Rule Engines, and even some Machine Learning (ML). Over the last few years, I have moved more and more towards ML, especially Deep Learning (DL) techniques mostly for NLP and some computer vision (CV). Lately, though, with the growing applications of vector based models (including Large Language Models (LLMs) and the older smaller Transformer models) into search, I feel things have come back full circle to some extent, and I find myself looking again at search and search adjacent platforms.

Recently an old friend from school got back in touch with me, and we caught each other up over the next 10 minutes or so on what happened in our lives since we last saw each other 40+ years ago. I thought that was quite efficient, so I am going to try the same strategy here -- basically start back our conversation with a laundry list of stuff I did that I think I can share and that you might find interesting. So here goes.

First, my amazing co-authors, Antonio Gulli and Amita Kapoor, and I, with the help of the editorial team at PackT Publishing, published the 3rd edition of our Deep Learning with Tensorflow and Keras book. The second edition was released right around the time Tensorflow 2.x came out, and the code examples have been updated to reflect that TF 2.x is now a mature de-facto standard. In addition, there are numerous updates to previously existing chapters in light of progress in Deep Learning. Transformers, Autoencoders, Generative Models (GAN) and Reinforcement Learning now have their own dedicated chapters, and there are new chapters on Unsupervised Learning, Self-supervised Learning, Probabilistic Tensorflow, AutoML and Graph Neural Networks (with Deep Graph Library (DGL). There are also bonus chapters on DL math, TPU handling, and ML Best Practices.

I also published a three part liveProject series with Manning on Machine Learning on Graphs for NLP, which explores techniques for applying Graph Theoretic techniques, both traditional and DL based, to analyze a corpus of text. While editorial support from both Manning and PackT were excellent, I found Manning's support to be more structured and comprehensive compared to PackT's, although their templates are also more rigid because their process is more automated. I think both approaches have their strengths, and quite honestly I don't know if I prefer one absolutely over the other.

While on the subject of books, I also reviewed a few books for PackT and O'reilly. Some of the notable ones are listed below, you can find my reviews for these on Amazon.

Transformers for Natural Language Processing by Denis Rothman. Another good book in this genre, complementary in some respects is Natural Language Processing with Transformers by Tunstall, et al.
Hands-on Unsupervised Learning using Python by Ankur Patel.
Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, a lot of content from his previous book but includes examples on neural (DL based) techniques with PyTorch.
Natural Language Processing with Tensorflow by Ganegedara and Lopatenko, very comprehensive (almost encyclopedic) coverage of NLP.

I also presented a couple of half-day tutorials at Open Data Science Conference (ODSC), the first in ODSC West in 2021 and the second at ODSC Global in 2022. Both of them were delivered online and were similar to the session I have previously about the session I presented in 2020. ODSC does not provide videos of the tutorial sessions, but both of them were very hands-on and involved no slides, just me going through a bunch of Colab notebooks with the attendees, and they are available at these Github repositories.

I also did a couple of external presentations, both based on my work on fine-tuning OpenAI's CLIP model that I described in my previous blog post from almost one and a half years ago.

Searching Across Images and Text with Raphael Pisoni and James Briggs, organized by Pinecone as part of its educational series on vector search. (slides)
Learning a Joint Embedding Representation for Image Search at the Haystack Search Relevance Conference (USA, 2022). (slides)

One of the spin-offs of my tutorial on ODSC was some work I did exploring neural relation extraction models (Github repository). In it I build six different neural models and train and evaluate it against the SciERC dataset to predict one of 8 relation types given a sentence with a pair of named entities.

Another interesting project a Data Scientist colleague and I did as part of our Dev10 project last year was to combine automated question generation and FAQ style question answering pipelines to produce a question answering system. FAQ style QA systems generally have a corpus of question answer pairs and match the incoming question to the most similar question and return the answer -- think Quora. Of course, Quora has managed to harness domain experts at web scale, but most FAQ based systems have expensive human domain experts painstakingly answering individual questions. We used an off-the-shelf Large Language Model (LLM) tuned for English question generation to generate questions against medical passages, and used them as our FAQ pairs. Empirically we found our results to be quite encouraging.

To build the demo, I used the Haystack library from Deepset and I highly recommend it. My demo used three LLMs in the pipeline, for Question Generation during indexing and for computing question similarity and for reader-retriever style question answering during search. It also had a traditional BM25 interface to the ElasticSearch instance containing the (passage, question) pairs and vectorized questions. So overall a fairly complex pipeline, but Haystack made it super easy to implement.

At work, I built a proof of concept (an elaboration in Agile speak) for an infrastructure component revolving around dictionary based NER, with additional other services for text annotation such as negation detection and term disambiguation to the pipeline. I am helping the team move the POC into production, and simultaneously working to add multi-language support to it beginning with French. Although it is mostly traditional NLP, there are some areas where we have applied ML models to good effect.

So thats basically all I can think of. In retrospect, it does look like I had stuff I could write about. I also have some ideas of what I want to write about going forward, and while I don't think I can maintain a weekly cadence like earlier years, I think a monthly cadence is definitely do-able. Hence the decision to resurrect the blog, excited to see where it will go.

Salmon Run

Saturday, February 18, 2023

Resurrection

2 comments: