Genetics discovery powered by massive multi-modal and multi-scale functional genomics knowledge graph

Preprint | Website | Talk at Stanford Graph Learning Workshop

Genome-wide association studies (GWASs) have identified tens of thousands of disease-associated variants and provided critical insights into developing effective treatments. However, limited sample sizes have hindered the discovery of variants for less common and rare diseases. Here, we introduce KGWAS, a novel geometric deep learning method that leverages a massive functional knowledge graph across variants and genes to improve detection power in small-cohort GWASs significantly.

Installation

Install Pytorch Geometric by following this instruction and then do:

pip install KGWAS

Core KGWAS API Usage

from kgwas import KGWAS, KGWAS_Data
data = KGWAS_Data(data_path = './data') ## initialize KGWAS data class with data path

data.load_kg() ## load the knowledge graph
data.load_external_gwas(PATH) ## load the GWAS file
data.process_gwas_file() ## process the GWAS file
data.prepare_split() ## prepare the train/val/test split

run = KGWAS(data, device = 'cuda:0', seed = 1) ## initialize KGWAS model
run.initialize_model()

run.train(epoch = 10) ## train the model

Data download

To ensure fast user experience, we provide a default fast mode of KGWAS, which uses Enformer embedding for variant feature and ESM embedding for gene features (instead of the baselineLD for variant and PoPS for gene since they are large files). For the fast mode, you do not need to download any data, the KGWAS API will automatically download the relevant files. This mode can be used to apply KGWAS to your own GWAS sumstats.

If you want to (1) use the full mode of KGWAS (i.e. larger node embeddings) or (2) access the null/causal simulations or (3) access the 21 subsampled GWAS sumstats across various sample sizes or (4) analyze the KGWAS sumstats for subsampled data or (5) analyze the KGWAS sumstats for all UKBB ICD10 diseases, please use this link. Note that this file is large (around 55GB) and may take a while to download.

Tutorial

Notebook	Description
Introduction	Tutorial on key KGWAS API and functionalities including on applying KGWAS to your own sumstats.
Simulation analysis	Tutorial on the simulation analysis.
Subsampling analysis	Tutorial on the subsampling analysis.
Disease critical network	Tutorial on generating disease critical network.
MAGMA analysis	Tutorial on the generating gene-level association scores.

Extended API Usage

`KGWAS_Data` class

data = KGWAS_Data(data_path = './data')

data_path: specify the path to the data folder. If not specified, the default path is ./data. If you use the full mode, unzip the data and use the path to the unzipped folder.

data.load_kg(snp_init_emb = 'enformer', go_init_emb = 'random', gene_init_emb = 'esm', sample_edges = False, sample_ratio = 1): load KGWAS knowledge graph and node embeddings

snp_init_emb: specify the variant embedding method. Options are enformer (default), baselineLD, SLDSC, cadd, kg, random
go_init_emb: specify the gene ontology embedding method. Options are random (default), biogpt, kg
gene_init_emb: specify the gene embedding method. Options are esm (default), pops_expression, pops, kg, random
sample_edges: whether to sample edges from the knowledge graph. Default is False
sample_ratio: the ratio of edges to sample. Default is 1

data.load_external_gwas(path, seed = 42): load external/your own GWAS file

path: specify the path to the GWAS file; The expected columns are CHR, SNP, P, N, and SNP should be in rs ID.
seed: specify the seed for the data split. Default is 42

data.load_full_gwas(pheno, seed): load full-cohort GWAS files already run in KGWAS. Note that this requires full data download.

pheno: specify the phenotype to load. Use data.get_pheno_list() to see all available phenotypes.

data.load_gwas_subsample(pheno, sample_size, seed): load subsampled GWAS files already run in KGWAS. Note that this requires full data download.

pheno: specify the phenotype to load. Use data.get_pheno_list()["21_indep_traits"] to see all available phenotypes.
sample_size: specify the sample size to load, it is available in 1000, 2500, 5000, 7500, 10000, 50000, 100000, 200000.
seed: specify the seed for the data split. It is available in 1,2,3,4,5.

data.load_simulation_gwas(simulation_type, seed): load the null and causal simulation data

simulation_type: specify the simulation type. Options are null and causal.
seed: specify the seed for the data split. It ranges from 1-500.

data.process_gwas_file(): process the GWAS file for training

data.prepare_split(test_set_fraction_data = 0.05): prepare the train/val/test split

test_set_fraction_data: specify the fraction of data to use as the test set. Default is 0.05

`KGWAS` class

run = KGWAS(data, weight_bias_track = False, device = 'cuda', proj_name = 'KGWAS', exp_name = 'KGWAS', seed = 42): initialize KGWAS model

data: specify the KGWAS data class
weight_bias_track: whether to track the weight and bias during training. Default is False
device: specify the device to run the model. Default is cuda
proj_name: specify the project name. Default is KGWAS
exp_name: specify the experiment name. Default is KGWAS
seed: specify the seed for the model. Default is 42

run.initialize_model(gnn_num_layers = 2, gnn_hidden_dim = 128, gnn_backbone = 'GAT', gnn_aggr = 'sum', gat_num_head = 1): initialize the KGWAS model

gnn_num_layers: specify the number of GNN layers. Default is 2
gnn_hidden_dim: specify the hidden dimension of the GNN. Default is 128
gnn_backbone: specify the GNN backbone. Options are GAT (default), GCN, SAGE, SGC
gnn_aggr: specify the GNN aggregation method. Options are sum (default), mean, min, max, cat
gat_num_head: specify the number of GAT heads. Default is 1

run.load_pretrained(path): load pretrained model

path: specify the path to the pretrained model

run.train(batch_size = 512, num_workers = 6, lr = 1e-4, weight_decay = 5e-4, epoch = 10, save_best_model = False, save_name = None, data_to_cuda = False): train the model

batch_size: specify the batch size. Default is 512. If you get CUDA OOM error, you can reduce the batch size.
num_workers: specify the number of workers for data loading. Default is 6
lr: specify the learning rate. Default is 1e-4
weight_decay: specify the weight decay. Default is 5e-4
epoch: specify the number of epochs. Default is 10
save_best_model: whether to save the best model. Default is False
save_name: specify the name to save the model. Default is run.exp_name
data_to_cuda: whether to move the data to CUDA. Default is False. You will be faster if you set it to True but will take a bit more CUDA memory.

Cite Us

@article{kgwas,
  title={Small-cohort GWAS discovery with AI over massive functional genomics knowledge graph},
  author={Huang, Kexin and Zeng, Tony and Koc, Soner and Pettet, Alexandra and Zhou, Jingtian and Jain, Mika and Sun, Dongbo and Ruiz, Camilo and Ren, Hongyu and Howe, Laurence J and others},
  journal={medRxiv},
  pages={2024--12},
  year={2024},
  publisher={Cold Spring Harbor Laboratory Press}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
demo		demo
fig		fig
gwas		gwas
kgwas		kgwas
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genetics discovery powered by massive multi-modal and multi-scale functional genomics knowledge graph

Installation

Core KGWAS API Usage

Data download

Tutorial

Extended API Usage

`KGWAS_Data` class

`KGWAS` class

Cite Us

About

Releases

Packages

Languages

snap-stanford/KGWAS

Folders and files

Latest commit

History

Repository files navigation

Genetics discovery powered by massive multi-modal and multi-scale functional genomics knowledge graph

Installation

Core KGWAS API Usage

Data download

Tutorial

Extended API Usage

KGWAS_Data class

KGWAS class

Cite Us

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`KGWAS_Data` class

`KGWAS` class

Packages