Name of SIG: SIG on Big Data for Healthcare, Medicine and Biology
Chair: Alex Kuo, University of Victoria, Canada. Email:

Co-Chair: Tony Sahama, Queensland University of Technology, Australia
Co-Chair: Robert Hsu, Chung Hua University, Taiwan
Advisor: Fernando Martin-Sanchez, University of Melbourne, Australia/Weill Cornell Medicine, Cornell University, USA

Advisor:Philip S. Yu, University of Illinois at Chicago, USA.

Advisor:Riccardo Bellazzi, University of Pavia, Italy

LinkedIn discussion site: TBD

Web site: TBD

Scope and Objectives

1. Background

Healthcare and life science is the most data intensive industry in the world. Modern clinical information systems, such as Electronic Health Records (EHRs), Computerized Physician Order Entry (CPOE), Laboratory Information Systems, and Picture Archiving and Communications System (PACS), Medical sensors can generate unimaginable volumes of patient data, per year. A published study indicated that the worldwide digital healthcare data is estimated to be between 1.2 and 2.4 exabytes (1018 bytes) a year, and is expected to reach 25 exabytes in 2020. The vast amounts of digital health data are called “Big Data” for healthcare.

Big Data in healthcare/life science is different from other disciplines such as social network or transactional business data in that it includes standardized structured, coded data (e.g. ICD, SNOMED CT), semi-structured data (e.g. HL7 messages), unstructured clinical notes, medical images (e.g. MRI, X-rays), genetic lab data, and other types of data (e.g. public health and mental or behavioral health). Huge volumes of very heterogeneous raw data are generated daily by a variety of hospital systems such as Electronic Health Records, Computerized Physician Order Entry, Picture Archiving and Communication Systems, Clinical Decision Support Systems, and Laboratory Information Systems. These information systems are utilized for functionalities in many healthcare settings such as physician offices and hospitals. Several published studies have asserted that Big Data managed effeciently can improve care delivery while reducing healthcare costs. A McKinsey Global Institute study suggests, “If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year”. A number of published articles also reported using Big Data to improve population health with better policy decision making.

2.Big Data Analytics Challenges

Extracting useful knowledge from Big Data can be considered as a processing pipeline that involves multiple distinct configuration stages to achieve full utilization. Each stage faces several specific challenges as follows:

  • Data aggregation: Currently, the most commonly used method to aggregate large quantities of data is to copy the data to a big storage drive and migrate the drive to configured destination(s). Big Data research projects usually involve multiple organizations, different geographic locations and large numbers of researchers. Therefore, data exchange between groups is very difficult when using this method. Another way is to use a network to transfer the data. However, transferring vast amounts of data into or out of a data repository (e.g. data warehouse/Cloud Computing) poses a significant networking challenge.
  • Data maintenance:  Since Big Data involves large collections of datasets, it is very difficult to efficiently store and maintain the data in a single hard drive using traditional data management systems such as relational databases. Also, it is a heavy IT burden (cost and time) for small organizations or labs to manage.
  • Data integration:This involves integrating and transforming data into an appropriate format for subsequent data analysis. However, Big Data in healthcare are unbelievably large, distributed, unstructured and heterogeneous, making integration and transformation all the more problematic. Integrating unstructured data is a major challenge for BDA. With structured EHR data integration there are also many integration issues.
  • Data analysis:

Complexity of the analysis – For some analysis algorithms, the computing time increases dramatically even with small amounts of data growth. For example, Bayesian Network is a popular algorithm for modeling knowledge in computational biology and bioinformatics. However, within the computation complexity of the Bayesian Network, the computing time for finding the best network also increases exponentially as the number of records rises.

Scale of the data – Even for simple data analysis, it could take several days, even months, to obtain the result when data is very large (e.g. zettabytes (1021 bytes) scale).

Parallelization of computing model – For those computationally intense problems, we can parallelize the analysis so that the problem can be solved by distributing tasks over many computers. However, if we cannot parallelize the analytic algorithm, it will be very difficult for those massive parallel-processing (MPP) tools to perform an efficient computation.

  • Pattern interpretation/application: Legality and ethics- In healthcare, the need to ensure patient data security, confidentiality and privacy based on mandated privacy to the general public by the privacy commissioners can be major barriers for health Big Dara Analytics. With large datasets, it is all too easy to unveil significant value by making information transparent. Thus, our ability to protect individual privacy in the era of Big Data is limited.
  • Data validation- Many people instinctively believe that bigger data always provides better information for decision making. Unfortunately, tools of Big Data for agile data science to harness knowledge do not protect us from inaccuracies, missing information and faulty assumptions. We can often be fooled into thinking that the discovered correlation is correct but the actual cause of this trend is hidden in the nuances of the data and its structure.
  • Knowledge representation- Having the ability to analyze Big Data is limited in its value if decision makers cannot understand the discovered patterns. Unfortunately, due to the complex nature of the analytics in healthcare, presentation of the results, data visualization, and its interpretation by non-technical domain experts are a major challenge.

3.Objectives and Research Topics/Areas

The main objective of the SIG was to propose solutions for technical and non-technical challenges, and explore previously unknown challenges and resolutions in the Big Data Analytics process for

  • Healthcare, Medicine, and Biology. The research topics/areas of interest at this SIG include, but not limited to,
  • Healthcare Big Data Analytics (Architecture, Methodologies and Tools) Big Data for Precision/Customized Medicine
  • Health Big Data Privacy/Security
  • Health Big Data Management/Repositories
  • Standard Development for Healthcare Big Data Governance/Interoperability Metadata for Healthcare Big Data Integration, Discovery and Interpretation
  • Machine Learning/Deep Learning for Big Data Analytics in Biology, Medicine, and Healthcare Big Data Analytics Infrastructure for Biology, Medicine, and Healthcare
  • Visualization Analytics for Big Data in Biology, Medicine, and Healthcare
  • Real World Big Data Analytics Case Study in Biology, Medicine, and Healthcare
  • Beyond genomics: the next big Innovations on the frontiers of science, technology and medicine Data-driven decision making and prescriptive analytics
  • Big data in bioinformatics Medical (Big) data mining
    Medical (Big) data mining
  • More Topics…

Founding Members

Abdul Roudsari, University of Victoria, Canada
Ajinkya Prabhune, Karlsruhe Institute of Technology, Germany
Alex Kuo, University of Victoria, Canada.
Alex Thomo, University of Victoria, Canada
Andre Kushniruk, University of Victoria, Canada
Elham Sedghi, University of Victoria, Canada
Elizabeth Borycki, University of Victoria, Canada
Fernando Martin-Sanchez, University of Melbourne, Australia/Weill Cornell Medicine, Cornell University, USA
Giancarlo Fortino, Università della Calabria, Italy
Huimin Lu, Kyushu Institute of Technology, Japan
Hung-Ming Chen, National Taichung University of Science and Technology, Taiwan
Jie Li,University of Tsukuba, Japan
Jinsong Wu,Universidad de Chile, Chile
Kai Wong, University of Alberta, Canada.
Omid Shabestari, Carilion Clinic, USA
Periklis Chatzimisios, Department of Informatics, Alexander TEI of Thessaloniki, Greece
Philip S. Yu, University of Illinois at Chicago, USA
Riccardo Bellazzi, University of Pavia, Italy
Shu-Lin Wang, National Taichung University of Science and Technology, Taiwan
Shu-Sheng Liaw, China Medical University, Taiwan
Song Guo,Hong Kong Polytechnic University, Hong Kong
Tony Sahama, Queensland University of Technology, Australia
Wei Hu, Nanjing University, China
Wei Xing, Francis Crick Institute, UK
Wo L Chang, National Institute of Standards and Technology, USA
Yin Zhang, Zhongnan University of Economics and Law, China
Yinglong Xia, HUAWEI, USA
Yonghong Peng, University of Sunderland, UK
Yudong Zhang, Nanjing Normal University, China