Genomics (q-bio.GN)

Expected Density of Random Minimizers
Shay Golan, Arseny M. Shur
Oct 23 2024 math.CO q-bio.GN arXiv:2410.16968v1

@misc{2410.16968, author = {Shay Golan and Arseny M.~Shur}, title = {{E}xpected {D}ensity of {R}andom {M}inimizers}, year = {2024}, eprint = {2410.16968}, note = {arXiv:2410.16968v1} }
PDF
Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size $\sigma$, a minimizer is defined by two integers $k,w\ge2$ and a total order $\rho$ on strings of length $k$ (also called $k$-mers). A string is processed by a sliding window algorithm that chooses, in each window of length $w+k-1$, its minimal $k$-mer with respect to $\rho$. A key characteristic of the minimizer is the expected density of chosen $k$-mers among all $k$-mers in a random infinite $\sigma$-ary string. Random minimizers, in which the order $\rho$ is chosen uniformly at random, are often used in applications. However, little is known about their expected density $\mathcal{DR}_\sigma(k,w)$ besides the fact that it is close to $\frac{2}{w+1}$ unless $w\gg k$. We first show that $\mathcal{DR}_\sigma(k,w)$ can be computed in $O(k\sigma^{k+w})$ time. Then we attend to the case $w\le k$ and present a formula that allows one to compute $\mathcal{DR}_\sigma(k,w)$ in just $O(w^2)$ time. Further, we describe the behaviour of $\mathcal{DR}_\sigma(k,w)$ in this case, establishing the connection between $\mathcal{DR}_\sigma(k,w)$, $\mathcal{DR}_\sigma(k+1,w)$, and $\mathcal{DR}_\sigma(k,w+1)$. In particular, we show that $\mathcal{DR}_\sigma(k,w)<\frac{2}{w+1}$ (by a tiny margin) unless $w$ is small. We conclude with some partial results and conjectures for the case $w>k$.
DNAHLM -- DNA sequence and Human Language mixed large language Model
Wang Liang
Oct 23 2024 q-bio.GN cs.LG arXiv:2410.16917v1

@misc{2410.16917, author = {Wang Liang}, title = {{DNAHLM} -- {DNA} sequence and {H}uman {L}anguage mixed large language {M}odel}, year = {2024}, eprint = {2410.16917}, note = {arXiv:2410.16917v1} }
PDF
There are already many DNA large language models, but most of them still follow traditional uses, such as extracting sequence features for classification tasks. More innovative applications of large language models, such as prompt engineering, RAG, and zero-shot or few-shot prediction, remain challenging for DNA-based models. The key issue lies in the fact that DNA models and human natural language models are entirely separate; however, techniques like prompt engineering require the use of natural language, thereby significantly limiting the application of DNA large language models. This paper introduces a hybrid model trained on the GPT-2 network, combining DNA sequences and English text to explore the potential of using prompts and fine-tuning in DNA models. The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application.
Hierarchical Classification for Predicting Metastasis Using Elastic-Net Regularization on Gene Expression Data
Alex Chu, Benjamin Osafo Agyare, Blessing Oloyede
Oct 23 2024 q-bio.GN arXiv:2410.16741v1

@misc{2410.16741, author = {Alex Chu and Benjamin Osafo Agyare and Blessing Oloyede}, title = {{H}ierarchical {C}lassification for {P}redicting {M}etastasis {U}sing {E}lastic-{N}et {R}egularization on {G}ene {E}xpression {D}ata}, year = {2024}, eprint = {2410.16741}, note = {arXiv:2410.16741v1} }
PDF
Metastasis is a leading cause of cancer-related mortality and remains challenging to detect during early stages. Accurate identification of cancers likely to metastasize can improve treatment strategies and patient outcomes. This study leverages publicly available gene expression profiles from primary cancers, with and without distal metastasis, to build predictive models. We utilize elastic net regularization within a hierarchical classification framework to predict both the tissue of origin and the metastasis status of primary tumors. Our elastic net-based hierarchical classification achieved a tissue-of-origin prediction accuracy of 97%, and a metastasis prediction accuracy of 90%. Notably, mitochondrial gene expression exhibited significant negative correlations with metastasis, providing potential biological insights into the underlying mechanisms of cancer progression.
AskBeacon -- Performing genomic data exchange and analytics with natural language
Anuradha Wickramarachchi, Shakila Tonni, Sonali Majumdar, Sarvnaz Karimi, Sulev Kõks, Brendan Hosking, Jordi Rambla, Natalie A. Twine, Yatish Jain, Denis C. Bauer
Oct 23 2024 cs.AI cs.CY q-bio.GN arXiv:2410.16700v1

@misc{2410.16700, author = {Anuradha Wickramarachchi and Shakila Tonni and Sonali Majumdar and Sarvnaz Karimi and Sulev Kõks and Brendan Hosking and Jordi Rambla and Natalie A.~Twine and Yatish Jain and Denis C.~Bauer}, title = {{A}sk{B}eacon -- {P}erforming genomic data exchange and analytics with natural language}, year = {2024}, eprint = {2410.16700}, note = {arXiv:2410.16700v1} }
PDF
Enabling clinicians and researchers to directly interact with global genomic data resources by removing technological barriers is vital for medical genomics. AskBeacon enables Large Language Models to be applied to securely shared cohorts via the GA4GH Beacon protocol. By simply "asking" Beacon, actionable insights can be gained, analyzed and made publication-ready.