Speeding Up Scientific Literature Reviews with NVIDIA NIM Microservices for LLMs
March 4, 2025A well-crafted systematic review is often the initial step for researchers exploring a scientific field. For scientists new to this field, it provides a structured overview of the domain. For experts, it refines their understanding and sparks new ideas. In 2024 alone, 218,650 review articles were indexed in the Web of Science database, highlighting the importance of these resources in research.
Completing a systematic review significantly enhances a researcher’s knowledge base and their academic impact. However, traditional review writing requires collecting, reading, and summarizing large volumes of academic articles on a specific topic. Due to the time-consuming nature of this manual exercise, the scope of processed literature is often confined to dozens or a few hundred articles. Interdisciplinary content—frequently outside the researcher’s area of expertise—adds another layer of complexity.
These challenges make it increasingly difficult to create comprehensive, reliable, and impactful systematic reviews.
The advent of large language models (LLMs) offers a groundbreaking solution, enabling the rapid extraction and synthesis of information from extensive literature. Participating in the Generative AI Codefest Australia provided a unique opportunity to explore this idea with support from NVIDIA AI experts to leverage NVIDIA NIM microservices for accelerating literature reviews. This enabled the rapid testing and fine-tuning of several state-of-the-art LLMs for our literature analysis process.
Testing the potential of LLMs for processing papers
As a research group specializing in physiological ecology within the ARC Special Research Initiative Securing Antarctica’s Environmental Future (SAEF), we embarked on writing a review of the literature on the global responses of non-vascular plants, such as moss or lichen, to wind.
However, we quickly faced a challenge: many relevant articles on wind-plant interactions failed to explicitly mention these key words in their titles or abstracts, which are typically used as primary filters during literature screening. A comprehensive analysis of the topic required manually reading the full text of each article—a highly time-consuming process.
We decided to explore the potential of using LLMs to extract content specifically related to wind-plant interactions from the articles. To achieve this, we implemented a simple Q&A application based on the LlaMa 3.1 8B Instruct NIM microservice (Figure 1). This enabled us to get an initial prototype quickly.
This first prototype, processing the papers sequentially, was extremely useful to craft and optimize the prompts to extract key information from each article.

To validate the accuracy of extracted information, we initially manually validated the results. When no significant errors were found in the test dataset, we identified opportunities to further enhance the efficiency of key information extraction using LLMs (Figure 2). These include converting the papers from a PDF format to structured JSON; extracting images, tables, and charts; and using parallel processing to speed up the processing of papers.
Enhancing the performance of LLMs for more efficient information extraction
By using NVIDIA NIM microservices for LLMs and nv-ingest, we deployed LLMs and a data ingestion pipeline in our local environment with eight NVIDIA A100 80-GB GPUs. We also fine-tuned the models using low-rank adaptation (LoRA) to improve the accuracy of information extraction from the papers.
We compiled a dataset of over 2K scientific articles related to the targeted research domain, sourced from the Web of Science and Scopus databases. Over a week during Generative AI Codefest, we focused on experimenting with various strategies to optimize the efficiency and accuracy of key information extraction from these articles.
Best-performing model
To determine the best-performing model, we tested a range of instruction-based and general-purpose LLMs from the NVIDIA API Catalog on a set of randomly selected articles. Each model was assessed for its accuracy and comprehensiveness in information extraction.
Ultimately, we determined that Llama-3.1-8B-instruct was the most suitable for our needs.
Processing speed
We developed a Q&A module using streamlit to answer user-defined research-specific questions.
To further improve processing speed, we implemented parallel processing of the prompts sent to the LLM engine and used KV-caching, which significantly accelerated the computation time by a factor of 6x when using 16 threads.
Extraction content types
We used nv-ingest to extract content from the original PDFs, including text, figures, tables, and charts, into structured JSON files. This enabled information extraction beyond text content, enabling a more comprehensive context for answering the questions.
Using JSON files instead of the original PDF files during inference also had a significant effect on lowering the processing time, by an additional factor of 4.25x.
Results
Thanks to these improvements, we significantly reduced the time required to extract information from our database of papers, with a total speedup of 25.25x compared to our initial implementation.
Processing the entirety of our database now takes less than 30 minutes using two A100 80-GB GPUs and 16 threads.
Compared to the traditional approach of manually reading and analyzing an entire article, which typically takes about one hour, this optimized workflow achieved a time savings of over 99% (Figure 3).

In addition to information extraction, we also investigated automated article classification. By fine-tuning Llama-3.1-8b-Instruct with a LoRA adapter on a sample of papers manually annotated, we successfully automated the classification process, demonstrating its effectiveness in organizing complex datasets of scientific papers.
The results indicated that each article required only 2 seconds for classification, compared to the 300+ seconds required on average for a manual classification by an experienced reader (Figure 3).
Future directions
We’re currently refining our workflow to further accelerate the processing. We’re also improving our user interface to provide easy access to more locally deployed LLMs and enhance accessibility by other researchers (Figure 4).
We plan to implement the NVIDIA AI Blueprint for multimodal PDF data extraction to identify the most relevant articles for each research question and interact with those papers.
Beyond technical improvements, we are aiming to organize extracted key information for each question and generate visualizations (such as maps with location of the experiments mentioned in the papers) to accelerate further the writing of the systematic review.

Summary
Our work at the Generative AI Codefest demonstrated the transformative potential of AI in accelerating systematic literature reviews. With NVIDIA NIM, we quickly moved from an idea to a working solution that significantly improves the process of information extraction from scientific papers.
This experience highlights how AI can streamline research workflows, enabling faster and more comprehensive insights. LLMs have the potential to facilitate interdisciplinary research, empowering scientists to explore complex, multi-domain research fields more effectively.
Moving forward, we aim to refine these methods and tools, ensuring that they are accessible and scalable for future research across diverse topics.