Immunis.AI

Enhancing Data Management and Processing Efficiency in Biotech

Client Industry: Biotech

Overview

Immunis.AI is an innovative immunogenomics company, dedicated to non-invasive disease detection through its Intelligentia platform. This platform combines RNA sequencing, immune system analysis, and machine learning to detect cancer and other diseases at the earliest stages, focusing on biomarker discovery and personalized medicine.

Challenge

A biotech company was facing significant challenges in managing and processing large volumes of genomic data, particularly with FastQ files from Lumina sequencing machines. The initial ingestion process was manual, time-consuming, and prone to errors, taking over a week to complete. Additionally, the RNA-seq data processing pipeline was complex and lacked scalability, which delayed research outcomes.

Solution Overview

FastQ Data Ingestion Pipeline:
1. Technologies Used: Apache NiFi, AWS EKS, Terraform, Custom Python Scripts, AWS S3.
2. Implementation:
  - Apache NiFi: Utilized for orchestrating data flow from various sources to AWS S3 for storage. NiFi's ability to handle data routing and transformation was pivotal in automating the ingestion process.
  - AWS EKS: Provided Kubernetes orchestration for deploying and managing containerized applications, ensuring scalability and high availability of the ingestion process.
  - Terraform: Implemented for Infrastructure as Code (IaC), allowing for consistent, error-free deployment of AWS resources.
  - Custom Python Scripts: Developed to handle specific data processing tasks, interfacing with NiFi for custom data manipulation and validation before storage.
  - AWS S3: Chosen over EBS for its cost-effectiveness and scalability, significantly reducing the storage costs and improving data access speed.
3. Outcome:
  - Reduced data ingestion time from over a week to approximately 72 hours, an efficiency improvement of about 75%.
  - Eliminated manual processes, reducing human error and operational costs.
  - Enabled the company to scale data ingestion according to research needs without performance degradation.
RNA-seq Data Processing Pipeline:
1. Technologies Used: AWS ECS, EC2, AWS Batch, Nextflow, SeqPurge, STAR, featureCounts.
2. Implementation:
  - AWS ECS & EC2: Provided container management and compute capacity for processing large datasets. ECS was used for orchestrating tasks in containers, while EC2 instances offered the necessary computational power.
  - AWS Batch: Employed for batch processing, optimizing job scheduling and resource allocation for RNA-seq tasks.
  - Nextflow: Utilized as the workflow manager to automate the pipeline steps, ensuring reproducibility and scalability of the RNA-seq analysis.
  - SeqPurge: For adapter trimming, ensuring data quality before further processing.
  - STAR: Used for mapping reads to the reference genome, providing high-speed and accuracy in alignment.
  - featureCounts: Applied for counting reads at the gene level, quantifying gene expression from the aligned data.
3. Outcome:
  - Streamlined the conversion of vast .fastq files into a manageable gene count matrix, speeding up the research cycle.
  - Automation led to consistent, high-quality results with reduced turnaround times.
  - The solution was scalable, adapting to increased data volumes without compromising on speed or accuracy.

Business Impact

Time Efficiency: Both solutions dramatically reduced the time required for data ingestion and processing, allowing for quicker research insights and faster decision-making.
Cost Reduction: Transitioning to AWS S3 and leveraging cloud-native services like ECS and Batch decreased operational costs while maintaining or enhancing performance.
Scalability: The biotech company could now handle increased data loads without needing to overhaul their infrastructure, supporting growth and new research initiatives.
Error Reduction: Automation and error-checking mechanisms in the data pipelines minimized manual errors, improving data integrity and reliability.

Conclusions

By implementing these innovative solutions, Dasnuve not only addressed immediate data management challenges but also laid a foundation for future scalability and efficiency in the client's research processes. This case study exemplifies how targeted technological interventions can transform operational capabilities in the biotech sector, leading to enhanced scientific discovery and competitive advantage.