Bioinformatics

2019

On the Upper Bounds of the Real-Valued Predictions

Benevenuta S, Fariselli P            

Bioinform Biol Insights. 2019; 13: 1177932219871263. Published online 2019 Aug 23. doi: 10.1177/1177932219871263

Abstract

Predictions are fundamental in science as they allow to test and falsify theories. Predictions are ubiquitous in bioinformatics and also help when no first principles are available. Predictions can be distinguished between classifications (when we associate a label to a given input) or regression (when a real value is assigned). Different scores are used to assess the performance of regression predictors; the most widely adopted include the mean square error, the Pearson correlation (ρ), and the coefficient of determination (or R2). The common conception related to the last 2 indices is that the theoretical upper bound is 1; however, their upper bounds depend both on the experimental uncertainty and the distribution of target variables. A narrow distribution of the target variable may induce a low upper bound. The knowledge of the theoretical upper bounds also has 2 practical applications: (1) comparing different predictors tested on different data sets may lead to wrong ranking and (2) performances higher than the theoretical upper bounds indicate overtraining and improper usage of the learning data sets. Here, we derive the upper bound for the coefficient of determination showing that it is lower than that of the square of the Pearson correlation. We provide analytical equations for both indices that can be used to evaluate the upper bound of the predictions when the experimental uncertainty and the target distribution are available. Our considerations are general and applicable to all regression predictors.

Keywords: Upper bound, free energy, machine learning, regression, prediction

 

2021

SeqFu: A suite of utilities for the robust and reproducible manipulation of sequence files

Telatin A, Fariselli P, Birolo G

Bioengineering (Basel). 2021 May 7;8(5):59. doi: 10.3390/bioengineering8050059.

Abstract

Sequence files formats (FASTA and FASTQ) are commonly used in bioinformatics, molecular biology and biochemistry. With the advent of next-generation sequencing (NGS) technologies, the number of FASTQ datasets produced and analyzed has grown exponentially, urging the development of dedicated software to handle, parse, and manipulate such files efficiently. Several bioinformatics packages are available to filter and manipulate FASTA and FASTQ files, yet some essential tasks remain poorly supported, leaving gaps that any workflow analysis of NGS datasets must fill with custom scripts. This can introduce harmful variability and performance bottlenecks in pivotal steps. Here we present a suite of tools, called SeqFu (Sequence Fastx utilities), that provides a broad range of commands to perform both common and specialist operations with ease and is designed to be easily implemented in high-performance analytical pipelines. SeqFu includes high-performance implementation of algorithms to interleave and deinterleave FASTQ files, merge Illumina lanes, and perform various quality controls (identification of degenerate primers, analysis of length statistics, extraction of portions of the datasets). SeqFu dereplicates sequences from multiple files keeping track of their provenance. SeqFu is developed in Nim for high-performance processing, is freely available, and can be installed with the popular package manager Miniconda.

Keywords: FASTA; FASTQ; bioinformatics; next-generation sequencing; software.

 

2022

Necroptosis-driving genes RIPK1, RIPK3 and MLKL-p are associated with intratumoral CD3+ and CD8+ T cell density and predict prognosis in hepatocellular carcinoma

Nicolè L*, Sanavia T*, Cappellesso R, Maffeis V, Akiba J, Kawahara A, Naito Y, Radu CM, Simioni P, Serafin D, Cortese G, Guido M, Zanus G, Yano H, Fassina A             

Journal for ImmunoTherapy of Cancer, 2022

 

2021

An antisymmetric neural network to predict free energy changes in protein variants

Benevenuta S, Pancotti C, Fariselli P, Birolo G, Sanavia T            

Journal of Physics D: Applied Physics DOI: 10.1088/1361-6463/abedfb

Abstract

The prediction of free energy changes upon protein residue variations is an important application in biophysics and biomedicine. Several methods have been developed to address this problem so far, including physical-based and machine learning models. However, most of the current computational tools, especially data-driven approaches, fail to incorporate the antisymmetric basic thermodynamic principle: a variation from wild-type to a mutated form of the protein structure (XW → XM) and its reverse process (XM → XW) must have opposite values of the free energy difference: Δ Δ GWM = - Δ Δ GMW. Here, we build a deep neural network system that, by construction, satisfies the antisymmetric properties. We show that the new method (ACDC-NN) achieved comparable or better performance with respect to other state-of-the-art approaches on both direct and reverse variations, making this method suitable for scoring new protein variants preserving the antisymmetry. The code is available at: https://github.com/compbiomed-unito/acdc-nn.

 

2022

Context dependency of nucleotide probabilities and variants in human DNA

Liang Y, Grønbæk C, Fariselli P, Krogh A.              

BMC Genomics volume 23, Article number: 87 (2022)

Abstract

Background

Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent.

Results

Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix.

Conclusions

Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well.

 

2022

Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants

Capriotti E, Fariselli P.   

Hum Genet. 2022 Jan 31. doi: 10.1007/s00439-021-02419-4. Online ahead of print.

Abstract

Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD and REVEL, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. As expected, our method results in ~ 3% lower area under the receiver-operating characteristic curve (AUC). When compared with an ensemble-based algorithm (REVEL). Nevertheless, the combination of predictions of multiple methods can help to identify more reliable predictions. These observations indicate that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.

 

Pagine