Genome-scale meta-analysis of breast cancer datasets identifies promising targets for drug development

Because of the highly heterogeneous nature of breast cancer, each subtype differs in response to several treatment regimens. This has limited the therapeutic options for metastatic breast cancer disease requiring exploration of diverse therapeutic models to target tumor specific biomarkers. Differentially expressed breast cancer genes identified through extensive data mapping were studied for their interaction with other target proteins involved in breast cancer progression. The molecular mechanisms by which these signature genes are involved in breast cancer metastasis were also studied through pathway analysis. The potential drug targets for these genes were also identified. From 50 DEGs, 20 genes were identified based on fold change and p-value and the data curation of these genes helped in shortlisting 8 potential gene signatures that can be used as potential candidates for breast cancer. Their network and pathway analysis clarified the role of these genes in breast cancer and their interaction with other signaling pathways involved in the progression of disease metastasis. The miRNA targets identified through miRDB predictor provided potential miRNA targets for these genes that can be involved in breast cancer progression. Several FDA approved drug targets were identified for the signature genes easing the therapeutic options for breast cancer treatment. The study provides a more clarified role of signature genes, their interaction with other genes as well as signaling pathways. The miRNA prediction and the potential drugs identified will aid in assessing the role of these targets in breast cancer.


Background
Cancer is one of the leading causes of death for the past several years and is the second cause of mortality according to the American Cancer Society (ACS) statistics after cardiovascular, infectious and parasitic disorders. Breast cancer is one of the most commonly diagnosed life-threatening malignancy that remains to be the leading cause of cancer incidence and mortality in women globally [1].
Several factors have been attributed towards the development of breast carcinoma. These include age, personal history of breast cancer, reproductive, environmental and genetic factors. Increasing age enhances the risk of breast cancer development [2]. Having a personal history of breast cancer also contributes towards a greater risk of second breast cancer that can be ipsilateral or contralateral. Family history of breast cancer can also enhance the risk of development of cancer in women. About 5-10% of women with breast cancer show an autosomal dominant inheritance while 20-25% have a positive family history [3]. Genetic predisposition alleles showing 40-85% of lifetime threat of breast cancer development include BRCA1 and BRCA2 mutations, TP53 mutations, PTEN, STK11, E-cadherin and neurofibromatosis (NF) [4].
The treatment strategies for breast cancer are largely determined by the status of progesterone receptor, estrogen receptor and the human epidermal growth factor receptor 2. Clinicopathological factors such as tumor grade, size and status of lymph node also determine the therapeutic plan, however, the biomarkers for the tumor invasion and metastasis are of profound importance in order to formulate new markers and treatment strategies for breast carcinomas. This will aid in both current therapies and tumor prognosis [5].
With the aid of in silico bioinformatic approaches the attainment of new treatment strategies have become easier. One such approach that has helped in identifying new markers in cancer therapy is the cDNA differential analysis [6]. In this study, 24 datasets were downloaded to analyze gene expression profiles in breast cancer and a functional analysis was performed to identify the differentially expressed genes (DEGs) between breast tumor cells and treated tissues. A genetic network was constructed as well as pathway analysis and miRNA target identification were performed to understand the underlying molecular mechanisms and to identify potential therapeutic targets for breast cancer. Moreover, drug-gene network analysis has also been performed to identify potential drug targets for breast cancer.

Accession of gene expression data
The study focuses on the identification of potential breast cancer targets through a differential screening method. The datasets of breast cancer were accessed from Gene Expression Omnibus database. The screening criteria was "organism: Homo sapiens", and "experiment type: expression profiling by array". The Affymetrix GeneChip Human Genome U133 Plus 2.0 Array (CDF: Hs133P_ Hs_ENST, version 10) (Affymetrix, Inc., Santa Clara, CA, 95051, USA) platform was used. All datasets comprised of GEO accession number, platform, sample type, number of samples and gene expression data. The array platform and hgu133plus2 annotation platform of probes were used to identify the differentially expressed genes. The software R and Bioconductor packages AffyQCReport, Affy, Annotate, AnnotationDbi, Limma, Biobase, AffyRNADegradation, hgu133plus2cdf, and hgu133a2cdf were used to perform the computational analysis [7].

Preprocessing and differential expression analysis of microarray datasets
The preprocessing of datasets was performed by preparing the phenodata files for each dataset in a recognizable format [8]. Using the R version 3.1.3, the Bioconductor ArrayQuality Metrics package was utilized for the normalization of the data to a median expression level for each gene [7]. After normalization, the background correction was done for perfect match (pm) and mismatch (mm) by Robust Multi-array Analysis (RMA). The method was used to eliminate the artifacts and local noise. The expression value with a p-value < 0.15 was measured as marginal log transformation. Afterwards, summarization was performed by RMA-algorithm in order to measure the averages between probes in a probe set to attain the summary of intensities.
The quality of RNA in these microarray datasets was measured using the AffyRNADegradation package of Bioconductor, also called degradation analysis [9]. Lastly, the DEGs in each dataset were identified by pairwise comparison and the Benjamini-Hochberg method [10] was employed for multiple testing correction. The differentially expressed genes were shortlisted and ranked according to their p-values and resulting scores. The cutoff values set were p-value ≤ 0.05, FDR < 0.05 (False Discovery Rate) and absolute log fold change logFC > 1 [11] to calculate the moderated statistics.

Data curation and cluster analysis
The shortlisted genes obtained through differential expression analysis were further screened to confirm their role in breast cancer using diverse data sources such as PubMed (http://www.ncbi.nlm.nih.gov/pubme d), MeSH (http://www.ncbi.nlm.nih.gov/mesh), OMIM (Online Mendelian Inheritance in Man) (http://www. ncbi.nlm.nih.gov/omim), and PMC database (http:// www.ncbi.nlm.nih.gov/pmc) [12]. Biomedical text mining helped in filtering significant disease specific genes. The CIMminner tool was used to perform the cluster analysis based on the expression values in each dataset using the Absolute Pearson correlation analysis. The cluster analysis revealed variations in gene expression levels between control and treated replicates [13].

Network analysis and identification of gene signatures
The protein-protein interaction network helped in identifying the interaction of each protein with other genes having different biological or molecular functions in a diseased state as compared to normal. The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [14] and Human Annotated and Predicted Protein Interaction (HAPPI) databases [15] were used to evaluate the  28:5 proteins that interacted with each other in breast cancer with a confidence score of 0.999. The visualization and analysis of molecular interactions of seeder genes with the target genes were done using Cytoscape (version 3.2.1, Temple Place, Suite 330, Boston, MA 02111-1307 USA) software. The role of target genes in breast cancer was mapped by OMIM, MeSH, and PMC databases to identify the breast cancer associated gene signatures whose dysregulation causes a pathological phenotype. A molecular sub-network of those genes that were associated with pathways of interest causing breast cancer was constructed. The topological network properties were calculated using Network Analyzer in Cytoscape [16]. The web-based tools Database for Annotation Visualization and Integrated Discovery (DAVID) [17] and FunRich [18] were used to study the biological functions of these genes including the gene ontology, functional annotation and pathway enrichment analysis [19,20].

miRNA target prediction
miRNAs are small non-coding RNAs considered as posttranscriptional regulators of several biological processes. Dysregulation of miRNAs leads to disruption of signaling pathways causing disease. The influence of miRNAs on gene targets is one beneficial approach to get a better understanding of disease etiology [21]. The miRNA targets of breast cancer related genes were predicted by miRDB target predictor (www.mirdb .org), an online database for miRNA target prediction and functional annotation. The miRNAs were selected based on the target score (≤ 99).

Integrated pathway modeling
The integrated and metabolic networks of breast cancer related source genes were analyzed and the correlation between test genes was observed. To recognize the underlying pathways involved in the progression of breast cancer, pathway analysis was performed for identifying biomarkers of the disease. The curation and mapping of candidate biomarkers were done using Kyoto Encyclopedia of Genes and Genomes (KEGG) [22], Reactome and Wiki pathways. PathVisio3tool was used to reconstruct the cellular and signaling pathways of potential biomarkers [23] and the potential mechanism of each marker in the pathway was studied based on evidence available in literature and databases.

Drug-gene network analysis
The target genes interrelated with the anti-breast cancer drugs were identified using CTD (http://ctdba se.org/) database, an open source database for the curation of chemical-gene, gene-disease and chemical-disease interactions from literature [24]. The chemical-gene interaction query was used to access drugs against each breast cancer related genes. Drugs that were directly linked with breast cancer related genes were sorted in this interaction network. The FDA approval status of these drugs was also verified using the DrugBank database [25].

Gene expression analysis and normalization
Twelve breast cancer datasets were downloaded from the GEO database with cell format. Each database was having size of ArrayBatch object 1164 × 1164 and 732 × 732 features with related Affyids (Table 1). Quantile normalization was performed for normalization and background correction. This was done to avoid systematic variation. The probe level data obtained after normalization show the quality of the individual array of each dataset in the MA plots ( Fig. 1). The severity of RNA-degradation and significance level was presented by the function plotAf-fyRNAdeg ( Fig. 2) and a single summary statistic for each array in the batch was produced by the function summary of AffyRNAdeg (Additional file 1: Table S1). Additional file 2: Table S2 provides the list of databases, tools, and software used in this study.

Identification and screening of differentially expressed genes
In each dataset the differential expression analysis provided 50 DEGs by pairwise comparison between biologically comparable groups. Out of these 50 DEGs, the top 24 genes were ranked and selected in each dataset. The selection was based on FDR (< 0.05), p-value (≤ 0.05) and |logFC| (> 1) parameters. These 24 DEGs were further shortlisted to eight common genes as potential biomarkers for breast cancer (Additional file 3: Table S3).

Data curation and cluster analysis
The gene mapping of 24 DEGs through PubMed, OMIM, MeSH, and PMC databases provided eight significant breast cancer associated genes: ID4, NCOA1, RHEB, PDZK1, PLAUR, AKC1R2, ANXA1 and SLIPI. The role of these genes in breast cancer was curated and counted ( Table 2). The genetic expression of breast cancer cell samples showed a clear difference between the control and treated replicates (Fig. 3).

Protein network analysis
The protein-protein interaction analysis revealed the interaction of breast cancer related genes with other potential genes contributing to a pathological phenotype.
The network showed a total of 207 nodes and 226 edges that were retrieved from STRING [14] and HAPPI [15] databases. The network was categorized in three neighborhoods: red and blue nodes indicate the breast cancer associated potential biomarkers while the remaining yellow nodes represent the non-breast cancer target proteins. The potential biomarkers were found to functionally interact with other biologically essential target proteins, some of which are TCF4, TP53, mTOR, NOTCH1, ESR1 and ESR2. The source protein ID4  showed interaction with TCF4, NOTCH1 and WNT while NCOA1 and PDZK1 interacted with ESR1 and ESR2 potential biomarker proteins. ANXA1 was also associated with the CCL5, CXCR 10 and CXCL8 family of cytokines. The network analyzer was used to analyze the topological properties of the network. It also helped in classifying and improving the network performance (Fig. 4). The disease gene mapping of target genes using CTD showed that more than 50 genes have a functional relation with the source/seeder genes in breast cancer (Fig. 5). In gene enrichment analysis, the targeted genes were selected based on fold change and a p-value cut-off (< 0.05). The analysis revealed significant enrichment of these genes with mTOR signaling pathway, TGF-β signaling pathway, P13-AKT signaling pathway, insulin signaling pathway, thyroid signaling pathway and complement coagulation cascade ( Table 4). The transcription factors identified were RBPJ, NHLH1, HENMT1, PHOX2A, CACD and ISL2. The transcription factors (TFs) NHLH1 and HENMT1 showed 50% abundance with known breast cancer genes (Fig. 5).

Pathway modeling
The gene signatures isolated were further studied to understand their role in the progression of breast cancer and their underlying molecular mechanism. The signature genes were analyzed for their interaction with other proteins in breast carcinogenesis through reconstruction of a network. The pathways involved in the progression of breast cancer were the MTOR signaling pathway, estrogen signaling pathway, P13-AKT signaling pathway, TGF-β signaling pathway and the insulin signaling pathway. The source genes interact with other target genes through these signaling pathways leading to the occurrence of breast cancer. The network shows the heterogeneous nature of breast cancer which is the major obstacle in defining therapies with desirable outcomes (Fig. 6).

Drug-gene network analysis
For drug-gene network analysis the toxicogenomic approach was used to further investigate the existing treatment options for breast cancer therapy. This was done to better understand the disease etiology. The publicly available database CTD identified 65 drugs that interacted with these signature genes. In total, 57 target drugs were FDA approved (Table 5). These drugs were found to interact with signature genes that are involved in the progression of breast cancer (Fig. 7).

Discussion
Due to its recurrence and heterogeneous nature, breast cancer is the leading cause of death in women globally. This calls for a better understanding of the molecular mechanisms of breast cancer in order to improve diagnosis and management.
This study focuses on the identification of several gene signatures, their functional annotation, potential protein-protein interactions and reconstruction of biological pathways for a better understanding of the disease. The differential expression analysis revealed eight gene signatures out of 50 DEGs based on physicochemical and functional studies that play a role in breast carcinogenesis. ID4, NOCA1, RHEB, ANXA1, AKR1C2, PDZK1, PLAUR and SLPI are the identified DEGs out of which five are upregulated and three are downregulated. The gene ontology of these genes showed functional enrichment in cellular communication, signal transduction, protein metabolism, transport and steroid hormone receptor signaling as well as essential roles in several important signaling pathways such as the MTOR, TGF-B, P13-Akt and insulin. These pathways have been studied for their role in the progression and occurrence of several cancers. ID4 belongs to a family of four helix-loop-helix (HLH) transcriptional regulators, termed as inhibitors of differentiation (ID) proteins. These proteins are involved in the regulation of several cell processes such as differentiation, transcription and cell cycle progression. Emerging evidence has shown a proto-oncogenic role of ID4 in basal like breast cancer (BLBC). An overexpression of  this gene is observed in this subtype of breast cancer and is correlated with the expression of TP53 protein which is involved in higher grade and metastasis risk. This has led it to be a poor prognostic marker of BLBC as the proliferation of BLBC cell lines require an overexpression of ID4 [26]. The gene network analysis also revealed the interaction of ID4 with several other proteins such as TCF, WNT, TP53 and NOTCH1. This supports the previous evidence of correlation of ID4 with TP53 in the proliferation of BLBC. The overexpression of nuclear receptor coactivator 1 (NCOA1) has also shown a positive correlation with disease metastasis and recurrence that resides in a subset of breast cancers. This gene belongs to the p160 SRC family and interacts with certain nuclear receptors and transcription factors (TFs) playing important roles in growth, development, reproduction and metabolism as well as in cancer. NCOA1 has been associated with HER2 expression, metastasis, disease recurrence and poor survival and overexpression in 19-29% of breast tumors [27]. Other interacting proteins identified through network analysis showing crosstalk with NCOA1 are the ESR and PPAR (Fig. 5). The pathway analysis also clarifies the role of NCOA1 in proliferation and metastasis of breast cancer by interaction with these proteins (Fig. 7). Another source protein identified through differential expression analysis is PDZ domain containing 1 (PDZK1) which is an adaptor protein expressed in the proximal tubules of kidney and has a pivotal role in lipid metabolism. However, this protein is thought to be responsive to estrogen in breast cancer cell lines (mcf-7). A significant correlation between 17B-estradiol plasma levels and PDZK1  mRNA expression has been shown in ER-α (+) breast tumors providing a link between Er-α and PDZK1 [28]. A potential candidate involved in the indirect link of this association is insulin-like growth factor-1 (IGS-1R). The gene ontology studies of these genes also revealed enrichment of these genes in the insulin signaling pathway, suggesting a link of this pathway in cell proliferation of breast tumors. Ras human enriched in brain (Rheb) is a small GTPbinding protein and a well-known regulator of mTOR. mTOR plays a pivotal role in cell proliferation, aging, protein synthesis and autophagy. Recent evidence has suggested a hyperactivity in Rheb-mTORC1 signaling axis in several human carcinomas [29]. Evidence also suggest an elevated expression of RHEB in epithelial cells of fibroadenomas providing an association of RHEB with insulin/AKT/TOR signaling pathway in benign tumor development [30]. The pathway analysis has also shown association of Rheb with these proteins suggesting its important role in cell cycle control and cell growth. Secreted proteins play a pivotal role in several types of cancer metastasis including breast tumors. One of the secreted proteins identified through differential analysis is SLPI which has a role in the progression and development of tumors. Several tumors have shown elevated gene expression levels of SLPI such as ovarian and lung cancer. A recent study has identified SLPI as a new target for anti-metastatic therapies due to its pro-metastatic part of secretome for breast cancer, chiefly for TNCs [31]. The two aldo-keto reductases AKR1C1 and AKR1C2 belong to the super family of AKR1C1 and are involved in progesterone metabolism. The metabolites of progesterone are basically involved in suppression of cell proliferation and adhesion. In tumorous breast tissues the expression of AKR1C1 and AKR1C2 is reduced promoting tumor growth and progression [32]. The association of over-activation of PLAUR (uPAR) with increased aggressive carcinoma is also well-studied. A correlation has been observed between HER2 and uPAR mRNA in disseminated tumors suggesting a cross talk between HER2 and uPAR signaling pathways causing recurrence or metastasis [33]. Moreover, Annexin A1 (AnxA1) is also a candidate regulator of oncogenic switch during which cancer cells change their phenotype from epithelial to migratory, mesenchymal-like. AnxA1 is an actin regulatory protein and its overexpression is associated with the BLBC subtype. It has a pro-angiogenic role in vascular endothelial cells, tumor growth and metastasis and is also involved in the regulation of TGFβ signaling. Evidence suggests AnxA1 as an additional marker in discriminating BLBC diagnosis from other subtypes [34]. The drug-gene network analysis revealed that several common drugs have shown interactions with these signature genes such as Tamoxifen, Cisplatin, Diazepam, Aspirin, Hydrocortisone, etc. opening the platform for repurposing of these drugs to better manage this disease.