(P26) Optimization of transcriptomics data analysis to accurate identification of immune signatures, clinical relevance and biomarkers discovery in infectious diseases


Anoop T. Ambikan [1], Flora Mikaeloff [1], Anders Sönnerborg [1], Ujjwal Neogi [1]


[1] Karolinska Institutet, ANA Futura, Department of Laboratory Medicine, Stockholm


Background: Elite Controllers are human immunodeficiency virus type 1 (HIV-1) infected individuals who can control the virus and slow down the progression to immunodeficiency without antiretroviral therapy (ART). It has been hypothesized that EC holds the key to achieve a functional cure for HIV infection. Only <0.5% of the HIV-infected population are identified as EC with strict definitions of characteristics. Over the years, bulk RNAseq technique was used to find out signatures in a disease condition compared with control groups. But selecting an appropriate pipeline for the analysis still poses a challenge. Various tools are available for RNAseq analysis, but many of them are designed for specific experiment designs and mainly used in non-communicable diseases. Choosing the wrong tool can negatively affect the results and its biological interpretation. Thus, optimizing the analysis pipeline selectively for the data and the experiment design of the study will be very useful to make robust results and draw the accurate clinical inference.

Methods: The experiment cohort consisted of EC and Viremic Progressors (HIV infected, progressing towards immune deficiency with the virus in the body). There were 19 samples (10 males & 9 females) in EC, and 17 samples (11 males & 6 females) were in VP. The analysis was done using different pipelines consisted of various tools in each step. Different tools were used at various steps in the pipeline, such as alignment, read count estimation, and differential expression analysis. At alignment step widely used, spliced short reads aligner STAR and transcript level pseudo-aligner Kallisto and Salmon were used. For read count estimation, featureCount from software package SUBREAD and HTSeq were selected. Count based tool DESeq2 and  EdgeR were chosen for differential gene expression analysis. Firstly, the results generated were compared with previously published results. Also, the results obtained from each pipeline were studied in terms of correlation at each step. Tools which are in less concordance with others were opted out and consensus results obtained from tools with more concordance were chosen for further downstream analysis.

Results: Inconsistent results were observed from different tools for the same data. The lack of agreement was seen at every three steps of the pipeline. Also, conflicts in outcomes were found at the gene level and transcript-level expression analysis. Features were found significantly regulated at gene-level analysis and found not regulated or negatively regulated at transcript level analysis. Large significant differences were found at the read count estimation step between results from feature count and HTSeq. There were features with zero read count using HTSeq, and >1000 read counts using featueCount. Differential expression analysis was seemed to be heavily impacted by reading count estimation process.

 Conclusion: Our study proved that different tools have a different outcome for the same data. Therefore it is very crucial to select an appropriate tool for the data and experiment design of the study. We proposed data-driven analysis plan instead of best practice analysis depending upon the data quality and experimental design to draw the accurate clinical inference, identify the biomarkers and most importantly for mechanical studies.