Biobank-scale Whole-Genome Sequencing (WGS) studies are increasingly pivotal in unraveling the genetic bases of diverse health outcomes. However, managing and analyzing these datasets’ sheer volume and complexity presents significant challenges. We propose vcf2agds, an all-in-one toolkit that efficiently converts WGS data from Variant Call Format (VCF) format to the annotated Genomic Data Structure (aGDS) format, significantly reducing data size while supporting seamless genomic and functional data integration for comprehensive genetic analyses. Additionally, STAARpipeline equipped with the aGDS files enabled scalable, comprehensive and functionally informed WGS analysis, facilitating the detection of common and rare coding and noncoding phenotype-genotype associations. We applied the STAARpipeline to analyze Alzheimer disease (AD) in 459,216 samples from the UK Biobank. All analyses scale well in computation time and memory. We discover several potentially new significant associations with AD. As WGS datasets continue to expand in size and complexity, our proposed tools will be increasingly useful for unlocking the full potential of genomic research.
报告人简介:李子林,东北师范大学成人直播
教授,入选国家级海外高层次人才计划青年项目。历任美国印第安纳大学医学院生物统计与健康数据科学系助理教授,哈佛大学生物统计系博士后、副研究员和研究科学家。本科与博士毕业于清华大学数学科学系,师从美国国家科学院与医学院两院院士林希虹院士。2023年当选为国际统计学会推选会员,获阿里巴巴达摩院青橙奖“最具潜力奖”。主要研究方向为高维数据中的统计方法理论和统计遗传学。相关研究成果以第一作者或通讯作者在Nature Computational Science、Nature Methods、Nature Genetics、JASA等国际学术期刊发表;主持国家自然科学基金面上项目,任中国遗传学会大人群健康与常见病遗传分会副主任委员。