Providing the context for you...
If you’re here, I’m assuming you are aware of the hype created around LLMs over the past few months. But they are in no position to be able to provide prompts for a specific consumer unless made to do so.
In the health sector, grounding and interpreting domain-specific and non-linguistic data is important. Generalized information won’t be of much use to the patient. So, we either need to fine-tune the LLM using the desired datasets or use prompt techniques to bring out the required response.
What is the paper about?
The paper investigates the capacity of LLMs to deliver multimodal health predictions based on contextual information like user demographics, health knowledge, and physiological data like resting heart rate, sleep minutes, etc.
It evaluates eight LLMs with diverse prompting and fine-tuning techniques on six public health datasets (PM Data, LifeSnaps, GLOBEM, AW_FB, MITBIH, and MIMIC-III).

Their experiments cover thirteen consumer health prediction tasks in mental health, activity, metabolic, sleep, and cardiac assessment.
Their fine-tuned model, Health-Alpaca, having it’s context enhanced, can yield up to 23.8% improvement in performance, exhibiting comparable performance to larger models (GPT-3.5 and GPT-4), achieving the best performance in 5 out of 13 tasks.
Motivation for the paper...
The performance of large language models in text generation and knowledge retrieval does present a lot of opportunities. But in sensitive domains like healthcare and finance, limitations exist, especially when it comes to harnessing the diverse collection of multi-modal, time-series data generated by wearable sensors.
This kind of data presents unique challenges for LLMs due to its high dimensionality, non-linear relationships, and continuous nature, requiring them to understand not only individual data points but also their dynamic patterns over time.
Some specialized medical domain LLMs have shown positive results, but when it comes to consumer health tasks relying heavily on physiological and behavioral time-series data, the challenges of grounding LLMs in non-linguistic data and the lack of standardized evaluation benchmarks become imminent.
So, this paper proposes Health-LLM, a framework in the healthcare domain that aims to bridge the gap between pre-trained knowledge in current LLMs and consumer health problems.
METHODS USED…

1) ZERO SHOT PROMPTING
It is the process of context enhancement using prompt templates.
1) The user provides user-specific information such as age, gender, weight, height, etc., which provides additional information that affects the understanding of health knowledge.
2) The health context provides the definition and equation that control certain health targets and inject new health knowledge into LLMs.
3) Temporal context is adopted to test the importance of temporal aspects in time-series data. Instead of using aggregated statistics, they utilized the raw time-series sequence.
4) ‘All’ is the case where all the contexts are combined in the prompt.
2) FEW SHOT PROMPTING
This approach provides the model with a handful of case studies to help it grasp and apply healthcare domain knowledge effectively. The prompting strategy is further enriched by integrating Chain-of-Thoughts and Self-Consistency prompting techniques.
3) INSTRUCTION TUNING
It is a technique where all parameters of a pre-trained model are further trained or fine-tuned for a target task. This process allows the model to adapt its pre-trained knowledge to the specificities of the new task, optimizing its performance.
In the context of health prediction, fine-tuning allows the model to deeply understand physiological terminologies, mechanisms, and context, thereby enhancing its ability to generate accurate and contextually relevant responses.
4) PARAMETER EFFICIENT FINE-TUNING
Methods like LoRA involve training a small proportion of parameters by injecting trainable low-rank matrices into each layer of the pre-trained model.
In the Health-LLM context, these techniques enable the model to adapt to healthcare tasks while maintaining computational efficiency.

MODELS USED…
MED-ALPACA
An advanced LLM, fine-tuned specifically for medical question-answering. Built upon the foundations of Alpaca, it utilizes a diverse array of medical texts.
CLINICAL CAMEL
An open LLM, fine-tuned on the LLaMA-2 70B architecture using QLoRA.
FLAN-T5
An instruction fine-tuned version of the text-to-text transfer transformer language model.
PALMYRA-MED
An LLM fine-tuned on a custom medical dataset, demonstrating very optimistic performance on medical datasets such as PubMedQA & MedQA.
PMC-LLAMA
Open source LLM trained on 4.8 million biomedical papers & 30k medical textbooks.
GPT 3.5
Specifically fine-tuned to provide direct answers or complete text rather than simulating conversations.
GPT 4
Exhibits remarkable capabilities in various NLP tasks, including translation, question answering, and text generation without task-specific fine-tuning.
RESULTS
The most anticipated part for so many researchers and health professionals.

ZERO SHOT & FEW SHOT PERFORMANCES
Shows comparable performance with the best task-specific baseline, indicating that LLMs already have a promising capability for health prediction tasks based on wearable data.
Moreover, GPT 3.5 & GPT 4 showing significant improvements with few shot prompting indicates that large LLMs have a stronger capability of quickly learning from examples for health tasks.
FINETUNING PERFORMANCE
Their fully fine-tuned model, Health-Alpaca, shows the best performance in 5 out of 13 healthcare tasks. By fine-tuning, it achieves comparable or better performance than GPT-4 which is two-magnitude larger (×250).
Health-Alpaca-lora, a model fine-tuned with a parameter-efficient fine-tuning technique LoRA showed a performance boost in most tasks, except CAL. These results indicate that LLMs can be easily tuned for tasks with multi-modal time-series wearable data.

CONCLUSION
All in all, adding contexts can significantly improve the model's performance. Among the three types of context information, adding health context shows the biggest performance boost. Results show varied effectiveness on different LLMs and datasets.

Palmyra-Med benefited the most from the enhancement, where it shows up to 44.58% improvement when adding temporal context because of its effect to emphasize the temporal aspects in time-series data.
However, it does not help improve the performance of large-size LLMs like GPT-3.5 or GPT-4. This is probably because they already possess the capability to understand the statistical summary of the time-series data.
Also, when dataset-specific fine-tuning often failed to predict tasks from other datasets, the multi-dataset fine-tuned model, Health-Alpaca, exhibited reasonable generalizable performance across tasks.
In a few cases, such as AW_FB → STRS, and LifeSnaps → ANX, the performance of the fine-tuned models surpassed that of the zero-shot and dataset-specific fine-tuning approaches. These findings suggest that fine-tuning on a single dataset can provide health knowledge to a certain extent and thereby improve overall generalization results.
However, such improvement is not consistently observed across all tasks, and it depends on the overlapping content across datasets.
LIMITATIONS
- The absence of annotations from professional healthcare experts and no debiasing may introduce differential diagnosis and inaccuracies in models.
- The focus was on predictive performance, neglecting a dedicated evaluation of the models’ reasoning capabilities, a key aspect.
- Lack of explicit privacy-preserving methods highlights the need for future research to address data security and ethical concerns.
It is recommended for everyone to go through the paper for the numerical results of the evaluations of the models & for a thorough intuition.
Let’s connect and build a project together: 🐈⬛
