Márton Münz, PhD

Computational Biology | Cloud Infrastructure | AI Consulting

AI-powered data workflows in Biotech

A promising frontier in biotech research is the use of Large Language Models (LLMs) and LLM-based agents that directly interact with internal infrastructure — i.e. have the ability to query databases, trigger computational workflows, and dynamically interact with in-house data pipelines. These AI tools will transform how researchers access and manage complex bioinformatics systems, enabling natural language commands to replace manual scripting or GUIs, boosting everyday productivity. In addition, the ability to use natural language to drive complex operations can reduce the overhead of managing fragmented data and compute ecosystems.

More crucially, LLM-based workflows and agents take automation to the next level. Thanks to their reasoning abilities and tool use, with access to both internal (structured or unstructured) data and external APIs, they are able to decompose complex tasks into manageable steps and carry them out autonomously. This opens the door to fully autonomous research assistants that can, for instance, monitor new data arrivals, preprocess raw datasets, identify and execute relevant analysis pipelines, interpret the results, and even generate preliminary reports or visualizations — all without human intervention.

LLM-driven automation isn't just about speeding up existing tasks; it's about redefining how research is conducted.

Complete autonomy of these AI agents is likely not ideal in domains like biotech and bioinformatics, so building human-in-the-loop systems is essential. By embedding human oversight into the system, it is possible to combine the speed and scalability of LLM-based agents with the domain expertise and critical judgment of scientists.

Examples of integrating LLMs into Biotech workflows

1. RAG systems with access to internal SOPs / Guidelines / Protocols: To the question by a lab technician “How do I run a maintenance wash on the Illumina MiSeq before a 600-cycle v3 kit?”, the LLM agent would respond with the correct procedure as described in the relevant in-house SOP document.

2. Querying internal omics (meta)data using natural language: Instead of learning SQL or Cypher, a researcher could just ask, “Show me all in-house RNA-seq experiments from liver tissues in mouse models of NASH”. The LLM agent would translate it into the appropriate query and return the results.

3. Research questions combining internal and external data: To the question “Can you find publications about colorectal cancer that discuss genes found to be differentially expressed in our in-house assays?”, the LLM agent would return relevant results from its knowledge graph integrating in house datasets with public data fetched via APIs.

4. Data management: "Give our bioinformatician, John Doe access to the RNA-seq dataset RNASeq_ASSAY_1, excluding Samples 1 to 6." The LLM agent will modify the underlying AWS IAM policy to give a user permission to access an S3 bucket folder, excluding the specified subfolders. Data access control is therefore as simple as issuing a natural language instruction, shielding users from the complexity of AWS policy syntax and reducing the risk of misconfiguration.

5. Provisioning/configuring cloud resources: "Launch a t2.large EC2 instance with a new EFS file system mounted. Switch the EFS to Elasatic Throughput mode." The agent modifies the AWS infrastructure as requested.

6. Calling bioinformatics pipelines and interpreting results: An LLM agent that has access to an RNA-seq Differential Gene Expression analysis pipeline and the Open Targets API would be able to answer the question: "Identify potential drug targets for Parkinson's disease by cross-referencing overexpressed genes (p-adj < 0.05 and logFC ≥ 1) from our in-house datasets with known gene-disease associations and druggability information from the Open Targets database." Note that this is different from 3. where the agent has the list of differentially expressed genes readily available in its knowledge base, while here the agent itself executes the bioinformatics pipeline to derive these genes.

7. Controlling multi-step workflows: “Please rerun the QC analysis of the sequencing run 240508_NB552789_0123_AH5C7KBGX3 after switching the adapter trimming strategy from a fixed 10 bp trim to dynamic trimming. If the mapping rate is normal, carry on with variant calling, but send me a Slack alert if not. Email me a full QC report at the end." The agent controls the flow of a data analysis pipeline implemented in AWS Step Functions.

8. Complex research queries: "Using publicly available gene expression and clinical outcome data and relevant publications not older than 5 years, can you identify novel biomarkers for drug resistance in triple-negative breast cancer, and propose a mechanism by which they mediate resistance?" To generate an answer to this request, an LLM agent will need to check multiple data sources (e.g. GEO, TCGA, cBioPortal, PubMed), reason across diverse types of information, and synthesize its findings into a coherent hypothesis.

Key implementation challanges

Although adding LLMs and LLM-based agents to biotech workflows holds great potential, their implementation presents some challenges:

- Data Security: Using LLM API endpoints like those of OpenAI or Anthropic (Claude) means that all prompts and queries — including proprietary data from a biotech company — are sent to and processed on external third-party servers, raising concerns about data confidentiality and compliance. While OpenAI, for example, claims not to use these data for training their models, some companies remain wary, especially in highly regulated environments such as pharmaceuticals or genomics, where any data exposure could lead to serious legal or competitive consequences. One solution is to avoid sending sensitive data to third-party APIs altogether by using a secure, local or cloud-based deployment of open source LLM models with the inference running on on-premises or VPC-isolated cloud servers. LLM inference, however, demands powerful GPUs to handle the intensive computation required for executing the model with low latency.

- Customization: General-purpose LLM chatbots like ChatGPT, especially when equipped with web search capabilities, are powerful tools for research. However, they lack access to internal data and infrastructure. Customizing LLM-based systems requires integrating them with internal data sources, defining the tools they can use to interact with in-house infrastructure, outlining their operational logic and workflows, specifying their level of autonomy, and the way they engage with users. Systems can differ significantly in these implementation details, depending on the specific use cases they are optimized for.

What I offer

I design and develop custom solutions for integrating LLM-based systems into your biotech workflows, tailored to your specific data, infrastructure, and use cases — from straightforward RAG implementations to complex agentic workflows. This includes deploying private LLM models, either on-premise or in secure, isolated cloud environments, to ensure full data confidentiality and compliance. We will work closely to define a custom system that fits your use case and aligns with your research goals and data security requirements.


← Back to Main Page