Salmon Run

tag:blogger.com,1999:blog-75837202025-04-13T05:12:11.575-07:00Salmon RunSwimming upstream on the technology tide, one technology at a time. A collection of articles, tips, and random musings on application development and system design.Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger486125tag:blogger.com,1999:blog-7583720.post-40278557854875865202024-12-31T10:18:00.000-08:002024-12-31T10:18:59.029-08:00Packaging ML Pipelines from Experiment to Deployment

As an ML Engineer, we are generally tasked with solving some business problem with technology. Typically it involves leveraging data assets that your organization already owns or can acquire. Generally, unless it is a very simple problem, there would be more than one ML model involved, maybe different types of models depending on the sub-task, maybe other supporting tools such as a Search Index

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-58107739484711560972024-12-08T22:17:00.006-08:002024-12-09T06:24:41.633-08:00Trip Report - PyData Global 2024

I attended PyData Global 2024 last week. Its a virtual conference, so I was able to attend it from the comfort of my home, although presentations seem to be scheduled to be maximally convenient, time-wise, for folks in the US East Coast and Western Europe, so some of them were a bit early for me. There were four main tracks -- the General Track, the Data / Data Science Track, the AI / ML track

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-33512948322063906712024-10-05T18:51:00.000-07:002024-10-05T18:51:47.532-07:00Using Knowledge Graphs to enhance Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) has become a popular approach to harness LLMs for question answering using your own corpus of data. Typically, the context to augment the query that is passed into the Large Language Model (LLM) to generate an answer comes from a database or search index containing your domain data. When it is a search index, the trend is to use Vector search (HNSW ANN based)

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-82972575258245693692024-07-29T17:07:00.000-07:002024-07-29T17:07:28.914-07:00Experiments with Prompt Compression

I recently came across Prompt Compression (in the context of Prompt Engineering on Large Language Models) on this short course on Prompt Compression and Query Optimization from DeepLearning.AI. Essentially it involves compressing the prompt text using a trained model to drop non-essential tokens. The resulting prompt is shorter (and in cases of the original context being longer than the LLM's

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-17246816038412179952024-06-30T23:04:00.000-07:002024-06-30T23:04:13.558-07:00Table Extraction from PDFs using Multimodal (Vision) LLMs

Couple of weeks ago a colleague and I participated in an internal hackathon where the task was to come up with an interesting use case using the recent multi-modal Large Language Models (LLMs). Multi-modal LLMs take not only text inputs via their prompt like earlier LLMs, but can also accept non-text modalities such as images and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI,

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-22928998316957670702024-06-23T22:10:00.000-07:002024-06-23T22:10:00.872-07:00Book Report: Pandas Workout

Unlike many Data Scientists, I didn't automatically reach for Pandas when I needed to analyze data. I came upon this discipline (Data Science) as a Java Software Engineer who used Python for scripting, so I was quite comfortable operating on JSON / CSV / text files directly, loading data into relational databases and running SQL against them, and building visualizations with Matplotlib. So when

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-20030448459153008612024-05-18T06:56:00.000-07:002024-05-18T07:21:45.074-07:00Finetuning RAGAS Metrics using DSPy

Last month, I decided to sign-up for the Google AI Hackathon, where Google provided access to their Gemini Large Language Model (LLM) and tasked participants with building a creative application on top of it. I have worked with Anthropic's Claude and OpenAI's GPT-3 at work previously, and I was curious to see how Gemini stacked up against them. I was joined in that effort by David Campbell and

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-22677067941053716342024-05-14T18:31:00.000-07:002024-05-16T07:32:14.220-07:00Performance Analysis of Float vs Byte vs Binary Vectors on OpenSearch

I've been working on an application where, given an input string, the objective is to recommend an output string that is similar to the input string, for some notion of similarity. A machine learning model, in this case a SentenceTransformers model, is taught this notion of similarity by showing it many examples of input-output pairs. The model's weights are then used to encode the part to be

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-79698125425942173702024-05-07T16:59:00.000-07:002024-05-09T10:46:50.146-07:00KGC/HCLS 2024 Trip Report

I was at KGC (Knowledge Graph Conference) 2024, which is happening May 6-10 at Cornell Tech. I was presenting (virtually) at their Health Care and Life Sciences (HCLS) workshop, so my speakers pass was only valid for today for the HCLS portion of KGC. My trip report covers a few talks that I attended here. Attending virtually was a bit chaotic as sessions went over sometimes, so you might leave a

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-89663434643512063462024-03-23T16:11:00.000-07:002024-03-23T17:11:51.744-07:00Book Report: Machine Learning for Drug Discovery

Drug Discovery is a field where biochemists (and more recently computer scientists) turn ideas into potential medications. I first came across a few applications in this area when checking out how to build Graph Neural Networks (GNN) as part of auditing the CS224W: Machine Learning with Graphs course from Stanford, some learnings of which I recycled into my Deep Learning with Graphs tutorial at

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-24446543882440732722024-03-17T13:26:00.000-07:002024-03-17T13:39:12.022-07:00Hierarchical (and other) Indexes using LlamaIndex for RAG Content Enrichment

At our weekly This Week in Machine Learning (TWIML) meetings, (our leader and facilitataor) Darin Plutchok pointed out a LinkedIn blog post on Semantic Chunking that has been recently implemented in the LangChain framework. Unlike more traditional chunking approaches that use number of tokens or separator tokens as a guide, this one chunks groups of sentences into semantic units by breaking them

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-135966245519838672024-02-24T17:31:00.000-08:002024-02-24T17:38:24.922-08:00Thoughts on using LangChain LCEL with Claude

I got into Natural Language Processing (NLP) and Machine Learning (ML) through Search. And this led me into Generative AI (GenAI), which led me back to Search via Retrieval Augmented Generation (RAG). RAG started out relatively simple -- take a query, generate search results, use search results as context for a Large Language Model (LLM) to generate an abstractive summary of the results. Back

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-61453525865821150692024-02-03T15:40:00.000-08:002024-02-04T08:32:38.449-08:00Book Report: Allen B Downey's Probably Overthinking It

I have read Allen Downey's books on statistics in the past, when trying to turn myself from a Software Engineer into what Josh Wills says a Data Scientist is -- someone who is better at statistics than a Software Engineer and better at software than a statistician (with somewhat limited success in the first area, I will hasten to add). Last year, I had the good fortune to present at PyData Global

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-10264458638861311692024-01-01T12:53:00.000-08:002024-01-01T12:57:25.173-08:00Knowledge Graph Aligned Entity Linker using SentenceTransformers

Most of us are familiar with Named Entity Recognizers (NERs) that can recognize spans in text as belonging to a small number of classes, such as Person (PER), Organization (ORG), Location (LOC), etc. These are usually multi-class classifier models, trained on input sequences to return BIO (Begin-Input-Output) tags for each token. However, recognizing entities in a Knowledge Graph (KG) using this

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-58963165549429194502023-12-09T12:38:00.000-08:002023-12-09T12:38:28.455-08:00PyData Global 2023: Trip Report

I had the opportunity to present at PyData Global this year. It is a virtual conference that ran over 3 days in multiple tracks from December 6 to 8. I talked about Building Learning to Rank models for search using Large Language Models. For those attending the conference, I already shared the links to the slides and the associated code on its Discord channel, but for those who are not, they are

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-91623043318710089792023-12-03T13:50:00.000-08:002023-12-03T13:50:43.224-08:00Building Learning to Rank Models with Generative AI

Generative AI has been the new cool kid on the AI / ML block since early this year. Like everyone else, I continue to be amazed and wowed with each successive success story as they break existing benchmark records and showcase novel applications built on top of their new functionality. I was also lucky to be involved in a Generative AI project since the middle of this year, which gave me access

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-63055670070535676012023-10-07T09:57:00.002-07:002023-12-11T08:49:01.363-08:00A PySpark idiom for efficient Model Inference

I recently needed to build an Apache Spark (PySpark) job where the task was (among other things) to use a Language Model (LM) to encode text into vectors. This is an embarassingly parallel job where the text to encoding is one to one, so something like Spark works very well here. We could, in theory at least, achieve a N-fold performance improvement by horizontally partitioning the data into N

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-31449916861096964862023-06-24T22:17:00.004-07:002023-06-24T22:17:31.314-07:00BMI 702 Review Part IV -- Biomedical Imaging

Here is Part IV of my ongoing review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundation of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous reviews in this series, they are listed below. BMI 702 Review Part I BMI 702 Review Part II (Graph Learning) BMI 702 Review Part III (

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-34069826457469371272023-06-09T19:05:00.004-07:002023-06-10T07:46:39.055-07:00Future of Data Centric AI -- Trip Report

I attended the Future of Data Centric AI 2023 this week, a free virtual conference organized by Snorkel AI. Snorkel.AI is a company built around the open-source Snorkel framework for programmatic data labeling. The project originally started at Stanford University's Hazy Research group, and many (all?) of the company's founders and some engineers are from the original research team. Snorkel.AI

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-48817784270144784802023-05-21T13:44:00.001-07:002023-05-21T13:44:08.107-07:00BMI 702 Review Part III (Language Modeling)

Welcome to Part III of my review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundations of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous two reviews in this series, they are listed below. BMI 702 Review Part I BMI 702 Review Part II (Graph Learning) As the title of my post

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-78814282147967010632023-04-29T10:46:00.001-07:002023-05-01T11:12:36.643-07:00Haystack US 2023: Trip Report

I attended the Haystack US 2023 Search Relevance conference last week. It was a great opportunity to share ideas and techniques around search and search relevance, as well as to catch up with old friends and acquaintances and a chance to make new ones. I was there only for the two days of the actual conference, but there were events before and after the conference as well. The full talk schedule

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2tag:blogger.com,1999:blog-7583720.post-81789958639186895842023-04-22T11:16:00.003-07:002023-05-21T13:43:17.358-07:00BMI 702 Review Part II (Graph Learning)

This week I continue with the review of the papers suggested in the Biomedical Artificial Intelligence (BMI 702), specifically the Graph Learning (M3) module. There are 7 papers in the first week (2 required, 5 optional) and 5 in the second week (2 required, 3 optional). In this post I will attempt to enumerate my high level takeaways from this module and summarize these 12 papers so you can

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-1390954202842608822023-03-25T13:32:00.003-07:002023-03-25T13:32:29.460-07:00BMI 702 Review Part I

I recently moved to our Health Markets division as part of an internal restructuring. While it is essentially a lateral shift, there are subtle differences in the kind of work I will do going forward versus what I have been doing at Elsevier so far. At my previous position at Labs, the focus of work was more on the use of technology to solve business problems of other teams such as those in

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2tag:blogger.com,1999:blog-7583720.post-56295031035409022652023-02-18T14:24:00.001-08:002023-02-18T14:24:42.005-08:00Resurrection

2022 has came and gone, and without a single blog post from my end. To be fair, my blogging output has been steadily decreasing over the last few years, so you would be justified in thinking of it as a somewhat inevitable trend. In other words, we had a good run, etc. Thinking back, one possible reason for my decreasing output is that my previous job was more product focused and my current one is

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2tag:blogger.com,1999:blog-7583720.post-38251499706264063562021-10-17T16:31:00.003-07:002021-10-19T08:45:03.710-07:00Fine-tuning OpenAI CLIP for different domains

In July this year, a group of us on the TWIML Slack Channel came together and participated in the Flax/JAX Community Week organized by Hugging Face and Google Cloud. Our project was about fine-tuning the CLIP Model from OpenAI with the RSICD (Remote Sensing Image Captioning Dataset), and ended up placing third. The code for the project is available on github at arampacha/CLIP-rsicd if you are

Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2