Back to Projects

Clinical Trials RAG Pipeline

A production-ready ETL pipeline with RAG capabilities for intelligent clinical trials data processing and querying

30.93s
Pipeline Runtime
<25ms
Vector Search
100%
Data Quality
1,500+
Records Processed

Abstract

This project presents a comprehensive ETL pipeline integrated with Retrieval-Augmented Generation (RAG) capabilities for processing and querying clinical trials data. The system ingests real-time data from ClinicalTrials.gov API, processes it using PySpark and Delta Lake, implements rigorous data quality controls with Great Expectations, and provides intelligent querying through OpenAI embeddings, FAISS vector search, and GPT-4 integration. The pipeline demonstrates enterprise-grade architecture with comprehensive audit trails, multilingual support, and sub-second query performance.

System Architecture

Data Flow Pipeline

1
Data Ingestion
ClinicalTrials.gov REST API
2
Processing
PySpark + Delta Lake
3
Validation
Great Expectations QC
4
Vectorization
OpenAI Embeddings
5
Indexing
FAISS Vector Search
6
Querying
GPT-4 RAG Interface

Technical Stack

Data Processing
PySpark 3.5, Delta Lake 4.0
Data Quality
Great Expectations, Pandas
Vector Search
FAISS, OpenAI Embeddings
LLM Integration
GPT-4, LangChain
Storage
Delta Lake, SQLite
Infrastructure
Java 17, Python 3.9+

Technical Implementation

Phase 1: Data Ingestion

Real-time data ingestion from ClinicalTrials.gov REST API with proper error handling and rate limiting.

Key Features:

  • Updated API v2 parameter structure (pageSize, filter.overallStatus)
  • Phase II/III clinical trials filtering
  • Comprehensive error handling and retry logic
  • 100 trials fetched in 0.25 seconds average

Phase 2: Data Processing

Enterprise-grade data transformation using PySpark 3.5 and Delta Lake 4.0 for ACID transactions and schema evolution.

Technical Achievements:

  • Partitioning by year(last_updated) for query optimization
  • Schema evolution with proper column types and null handling
  • Data quality standardization (phase names, sponsors)
  • 1,500+ records processed in 14.94 seconds

Phase 3: Vector Search & RAG

Advanced vector search implementation with OpenAI embeddings and FAISS indexing for sub-second query performance.

Performance Metrics:

  • 1,500 vectors × 1,536 dimensions = 2.3M vector elements
  • FAISS Inner Product with L2 normalization
  • Sub-25ms search times achieved
  • 100% citation accuracy with regex extraction

Technical Challenges & Solutions

Java Version Compatibility

Problem: System had Java 13, but PySpark 3.5 requires Java 17+
Solution: Downloaded OpenJDK 17 from Adoptium, configured JAVA_HOME environment variable
Result: PySpark + Delta Lake working flawlessly

API Parameter Deprecation

Problem: Original API parameters (min_rnk, max_rnk) were deprecated
Solution: Researched v2 API documentation, updated to pageSize and filter.overallStatus
Result: 100 trials fetched in 0.25 seconds per run

Great Expectations API Changes

Problem: GE 1.5.3 completely changed API, old methods removed
Solution: Built pandas-based validation with GE-style reporting
Result: 100% validation pass rate with professional HTML reports

Results & Performance

Performance Benchmarks

Pipeline Runtime
30.93s
Target: <90s
Vector Search
15-25ms
Target: <50ms
Data Quality
100%
Target: >95%
Query Response
2-4s
Target: <5s

Enterprise Features

Scalable: Designed for millions of records
Fast: Sub-second search, 30s full pipeline
Reliable: ACID transactions, error handling
Auditable: Complete query tracking
Multilingual: Korean/English support
Maintainable: Clean code, logging

Conclusion & Impact

This project demonstrates the successful implementation of a production-ready ETL pipeline integrated with cutting-edge RAG capabilities. The system showcases enterprise-grade data engineering skills, modern AI/ML integration, and professional software development practices.

5 Hours
Development Time
1,500+
Records Processed
100%
Quality Score

Technologies Used

PySparkDelta LakeOpenAIFAISSGreat ExpectationsRAGETLPythonGPT-4Vector SearchJava 17SQLitePandasLangChainREST API