Abstract
Motivation: To understand the molecular mechanisms involved in cancer development, significant efforts are being invested in cancer research. This has resulted in millions of scientific articles. An efficient and thorough review of the existing literature is crucially important to drive new research.
This time-demanding task can be supported by emerging computational approaches based on text mining which offer a great opportunity to organise and retrieve the desired information efficiently from sizable databases. One way to organise existing knowledge on cancer is to utilise the widely accepted framework of the Hallmarks of Cancer. These hallmarks refer to the alterations in cell behaviour
that characterise the cancer cell.
Results: We created an extensive Hallmarks of Cancer taxonomy and developed automatic text mining methodology and a tool (CHAT) capable of retrieving and organising millions of cancer-related references from PubMed into the taxonomy. The efficiency and accuracy of the tool was evaluated intrinsically as well as extrinsically by case studies. The correlations identified by the tool show that it offers a great potential to organise and correctly classify cancer-related literature. Furthermore, the
tool can be useful, for example, in identifying hallmarks associated with extrinsic factors, biomarkers and therapeutics targets.