tag:blogger.com,1999:blog-75837202024-03-17T13:39:45.009-07:00Salmon RunSwimming upstream on the technology tide, one technology at a time. A collection of articles, tips, and random musings on application development and system design.Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger476125tag:blogger.com,1999:blog-7583720.post-24446543882440732722024-03-17T13:26:00.000-07:002024-03-17T13:39:12.022-07:00Hierarchical (and other) Indexes using LlamaIndex for RAG Content EnrichmentAt our weekly This Week in Machine Learning (TWIML) meetings, (our leader and facilitataor) Darin Plutchok pointed out a LinkedIn blog post on Semantic Chunking that has been recently implemented in the LangChain framework. Unlike more traditional chunking approaches that use number of tokens or separator tokens as a guide, this one chunks groups of sentences into semantic units by breaking them Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-135966245519838672024-02-24T17:31:00.000-08:002024-02-24T17:38:24.922-08:00Thoughts on using LangChain LCEL with ClaudeI got into Natural Language Processing (NLP) and Machine Learning (ML) through Search. And this led me into Generative AI (GenAI), which led me back to Search via Retrieval Augmented Generation (RAG). RAG started out relatively simple -- take a query, generate search results, use search results as context for a Large Language Model (LLM) to generate an abstractive summary of the results. Back Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-61453525865821150692024-02-03T15:40:00.000-08:002024-02-04T08:32:38.449-08:00Book Report: Allen B Downey's Probably Overthinking ItI have read Allen Downey's books on statistics in the past, when trying to turn myself from a Software Engineer into what Josh Wills says a Data Scientist is -- someone who is better at statistics than a Software Engineer and better at software than a statistician (with somewhat limited success in the first area, I will hasten to add). Last year, I had the good fortune to present at PyData GlobalSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-10264458638861311692024-01-01T12:53:00.000-08:002024-01-01T12:57:25.173-08:00Knowledge Graph Aligned Entity Linker using SentenceTransformersMost of us are familiar with Named Entity Recognizers (NERs) that can recognize spans in text as belonging to a small number of classes, such as Person (PER), Organization (ORG), Location (LOC), etc. These are usually multi-class classifier models, trained on input sequences to return BIO (Begin-Input-Output) tags for each token. However, recognizing entities in a Knowledge Graph (KG) using this Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-58963165549429194502023-12-09T12:38:00.000-08:002023-12-09T12:38:28.455-08:00PyData Global 2023: Trip ReportI had the opportunity to present at PyData Global this year. It is a virtual conference that ran over 3 days in multiple tracks from December 6 to 8. I talked about Building Learning to Rank models for search using Large Language Models. For those attending the conference, I already shared the links to the slides and the associated code on its Discord channel, but for those who are not, they are Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-91623043318710089792023-12-03T13:50:00.000-08:002023-12-03T13:50:43.224-08:00Building Learning to Rank Models with Generative AIGenerative AI has been the new cool kid on the AI / ML block since early this year. Like everyone else, I continue to be amazed and wowed with each successive success story as they break existing benchmark records and showcase novel applications built on top of their new functionality. I was also lucky to be involved in a Generative AI project since the middle of this year, which gave me access Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-63055670070535676012023-10-07T09:57:00.002-07:002023-12-11T08:49:01.363-08:00A PySpark idiom for efficient Model InferenceI recently needed to build an Apache Spark (PySpark) job where the task was (among other things) to use a Language Model (LM) to encode text into vectors. This is an embarassingly parallel job where the text to encoding is one to one, so something like Spark works very well here. We could, in theory at least, achieve a N-fold performance improvement by horizontally partitioning the data into N Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-31449916861096964862023-06-24T22:17:00.004-07:002023-06-24T22:17:31.314-07:00BMI 702 Review Part IV -- Biomedical ImagingHere is Part IV of my ongoing review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundation of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous reviews in this series, they are listed below.
BMI 702 Review Part I
BMI 702 Review Part II (Graph Learning)
BMI 702 Review Part III (Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-34069826457469371272023-06-09T19:05:00.004-07:002023-06-10T07:46:39.055-07:00Future of Data Centric AI -- Trip ReportI attended the Future of Data Centric AI 2023 this week, a free virtual conference organized by Snorkel AI. Snorkel.AI is a company built around the open-source Snorkel framework for programmatic data labeling. The project originally started at Stanford University's Hazy Research group, and many (all?) of the company's founders and some engineers are from the original research team. Snorkel.AI Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-48817784270144784802023-05-21T13:44:00.001-07:002023-05-21T13:44:08.107-07:00BMI 702 Review Part III (Language Modeling)Welcome to Part III of my review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundations of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous two reviews in this series, they are listed below.
BMI 702 Review Part I
BMI 702 Review Part II (Graph Learning)
As the title of my post Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-78814282147967010632023-04-29T10:46:00.001-07:002023-05-01T11:12:36.643-07:00Haystack US 2023: Trip ReportI attended the Haystack US 2023 Search Relevance conference last week. It was a great opportunity to share ideas and techniques around search and search relevance, as well as to catch up with old friends and acquaintances and a chance to make new ones. I was there only for the two days of the actual conference, but there were events before and after the conference as well. The full talk schedule Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2tag:blogger.com,1999:blog-7583720.post-81789958639186895842023-04-22T11:16:00.003-07:002023-05-21T13:43:17.358-07:00BMI 702 Review Part II (Graph Learning)This week I continue with the review of the papers suggested in the Biomedical Artificial Intelligence (BMI 702), specifically the Graph Learning (M3) module. There are 7 papers in the first week (2 required, 5 optional) and 5 in the second week (2 required, 3 optional). In this post I will attempt to enumerate my high level takeaways from this module and summarize these 12 papers so you can Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-1390954202842608822023-03-25T13:32:00.003-07:002023-03-25T13:32:29.460-07:00BMI 702 Review Part II recently moved to our Health Markets division as part of an internal restructuring. While it is essentially a lateral shift, there are subtle differences in the kind of work I will do going forward versus what I have been doing at Elsevier so far. At my previous position at Labs, the focus of work was more on the use of technology to solve business problems of other teams such as those in Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2tag:blogger.com,1999:blog-7583720.post-56295031035409022652023-02-18T14:24:00.001-08:002023-02-18T14:24:42.005-08:00Resurrection2022 has came and gone, and without a single blog post from my end. To be fair, my blogging output has been steadily decreasing over the last few years, so you would be justified in thinking of it as a somewhat inevitable trend. In other words, we had a good run, etc. Thinking back, one possible reason for my decreasing output is that my previous job was more product focused and my current one isSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2tag:blogger.com,1999:blog-7583720.post-38251499706264063562021-10-17T16:31:00.003-07:002021-10-19T08:45:03.710-07:00Fine-tuning OpenAI CLIP for different domainsIn July this year, a group of us on the TWIML Slack Channel came together and participated in the Flax/JAX Community Week organized by Hugging Face and Google Cloud. Our project was about fine-tuning the CLIP Model from OpenAI with the RSICD (Remote Sensing Image Captioning Dataset), and ended up placing third.
The code for the project is available on github at arampacha/CLIP-rsicd if you are Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com2tag:blogger.com,1999:blog-7583720.post-62744447909282942132021-05-21T15:03:00.006-07:002021-06-01T09:51:40.029-07:00Distributed Training of a Bengali ALBERT modelEven though I am from India and my mother tongue is Bengali, and I speak, read, and write both Hindi and Bengali almost as well as English, in my career with Natural Language Processing (NLP) I have worked exclusively with English. This is probably not that uncommon, because until recently, English was the language where most NLP work happened, and to a lesser extent some of the major European Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com3tag:blogger.com,1999:blog-7583720.post-14366009463295679892021-03-27T21:37:00.002-07:002021-03-29T10:23:20.737-07:00More tricks to improve performance of CIFAR-10 classifierSome time back I wrote a post about Tricks to improve performance of CIFAR-10 classifier, based on things I learned from New York University's Deep Learning with Pytorch course taught by Yann Le Cun and Alfredo Canziani. The tricks I covered were conveniently located on a single slide in one of the lectures. Shortly thereafter, I learned of a few more tricks that were mentioned in passing, so I Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-90549414324521615522021-02-28T10:36:00.000-08:002021-02-28T10:36:50.443-08:00Learning VespaNo, not the scooter :-).
I meant Vespa.AI, a search engine that supports structured search, text search, and approximate vector search. While Vespa's vector search functionality was probably built in response to search engines incorporating vector based signals into their ranking algorithms, there are many ML/NLP pipelines as well that can benefit from vector search, i.e., the ability to find Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-81306421597395105352021-02-07T10:44:00.001-08:002021-02-13T10:50:19.119-08:00Comparison of Text Augmentation Strategies for Spam DetectionSome time back, I found myself thinking of different data augmentation strategies for unbalanced datasets, i.e. datasets in which one or more classes are over-represented compared to the others, and wondering how these strategies stack up to one another. So I decided to set up a simple experiment to compare them. This post describes the experiment and its results.
The dataset I chose for this Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com4tag:blogger.com,1999:blog-7583720.post-48122735058602267492021-01-19T11:43:00.001-08:002021-01-19T12:22:17.075-08:00Tricks to improve performance of CIFAR-10 classifierI am participating in a weekly meetup with a TWIML (This Week in Machine Learning) group where we go through video lectures of the NYU (New York University) course Deep Learning (with Pytorch). Each week we cover one of the lectures in an "inverted classroom" manner -- we watch the video ourselves before attending, and one person leads the discussion, covering the main points of the lecture and Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-14116672062763653162020-12-19T12:39:00.000-08:002020-12-19T13:28:27.554-08:00First steps with Pytorch LightningSome time back, Quora routed a "Keras vs. Pytorch" question to me, which I decided to ignore because it seemed too much like flamebait to me. Couple of weeks back, after discussions with colleagues and (professional) acquaintances who had tried out libraries like Catalyst, Ignite, and Lightning, I decided to get on the Pytorch boilerplate elimination train as well, and tried out Pytorch LightningSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com3tag:blogger.com,1999:blog-7583720.post-11552670739627247612020-11-30T18:40:00.000-08:002020-11-30T18:40:07.045-08:00Word Sense Disambiguation using BERT as a Language ModelThe BERT (Bidirectional Encoder Representation from Transformers) model was proposed in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, et al, 2019). BERT is the encoder part of an encoder-decoder architecture called Transformers, that was proposed in Attention is all you need (Vaswani, et al., 2017). The BERT model is pre-trained on two tasks Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com6tag:blogger.com,1999:blog-7583720.post-62098610485874143242020-11-14T17:08:00.000-08:002020-11-14T17:08:33.860-08:00ODSC trip report and Keras TutorialI attended ODSC (Open Data Science Conference) West 2020 end of last month. I also presented Keras from Soup to Nuts -- an example driven tutorial there, a 3-hour tutorial on Keras. Like other conferences this year, the event was all-virtual. Having attended one other all-virtual conference this year (Knowledge Discovery and Data Mining (KDD) 2020 and being part of organizing another (an Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-70424104598449457192020-10-31T18:15:00.001-07:002020-10-31T18:18:03.645-07:00Entities from CORD-19 using Dask, SciSpaCy, and Saturn CloudIts been a while since I last posted here, but I recently posted on our Elsevier Labs blog, and I wanted to point folks here to that. The post, titled How Elsevier Accelerated COVID-19 research using Dask and Saturn Cloud, describes some work I did to extract biomedical entities from the CORD-19 dataset using Dask and trained Named Entity Recognition (NER) and Named Entity Recognition Sujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com0tag:blogger.com,1999:blog-7583720.post-55084408868400637902020-08-08T11:56:00.003-07:002020-08-08T11:56:39.191-07:00Disambiguating SciSpacy + UMLS entities using the Viterbi algorithmThe SciSpacy project from AllenAI provides a language model trained on biomedical text, which can be used for Named Entity Recognition (NER) of biomedical entities using the standard SpaCy API. Unlike the entities found using SpaCy's language models (at least the English one), where entities have types such as PER, GEO, ORG, etc., SciSpacy entities have the single type ENTITY. In order to furtherSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.com10