We merged five breast cancer WGS datasets, downloaded from public repositories: (1) 208 cases from the PCAWG consortium1; (2) 72 cases from the International Cancer Genome Consortium (ICGC) French cohort19; (3) 395 cases from the Sanger cohort15 (among the original 560, 108 and 47 were included in the PCAWG and ICGC French cohort studies, respectively; we were able to download 395 of the remaining 405); (4) 87 cases from the British Columbia cohort20 (among the original 93, 5 cases that were sequenced from formalin-fixed paraffin-embedded tissues were excluded, 1 could not be downloaded); and (5) 20 cases from the Yale study21. Two cases were excluded owing to poor data quality. This established our 780-patient cohort for detailed analysis (Supplementary Table 1). The institutional review board of the Harvard Faculty of Medicine approved this study (IRB18-0151). Individual studies complied required ethical guidelines per published manuscripts.
Uniform data processing and identification of variants
To remove any potential artefacts that may arise from different data processing and analysis steps for different cohorts, we re-processed all data and applied a uniform set of variant calling methods. We used Bazam (v1.0.1)51 to extract FASTQ files from the BAM or CRAM files and realigned the reads to hs37d5 (as done in PCAWG) using BWA-MEM (v0.7.15)52. We used Samtools (v1.3.1)53 to merge the realigned bam fragments and Picard (v2.8.0) to add read groups and to mark PCR duplicates.
We applied the Hartwig Medical Foundation (HMF) bioinformatics pipeline22 for our analysis (https://github.com/hartwigmedical/hmftools), as it provides a streamlined software suite for analysing multiple variant types including SNVs, indels, SVs and allele-specific CNVs. We chose this pipeline because, in their PURPLE algorithm (v2.54), the boundaries of copy-number segments were determined by jointly analysing regional depth of coverage (COBALT v1.11), B-allele frequency (AMBER v3.5), and, most importantly, SVs. This integration resulted in near-complete concordance between the rearrangement breakpoints and the copy-number boundaries, which was pivotal in analysing the SVs at the amplification boundaries. SVs were called primarily by GRIDSS2 (ref. 54) (v2.12.0), annotated with RepeatMasker (v4.1.2-p1) and Kraken2 (ref. 55) (v2.1.2), filtered by GRIPSS (v1.9), and further annotated and analysed with LINX56 (v1.15). SNVs and indels were primarily called by SAGE (v2.8) with recommended parameters for 30x tumour coverage. Tumours showing genomic features of HR deficiency were identified using CHORD57 (v2.00).
Based on our benchmark analysis (Supplementary Note), we applied an in-house filter for short, non-reciprocal, and singleton inversions for samples that showed a large number of such probably artefactual patterns. We also filtered out somatic L1 transduction events that originated from the 18 hot source retroelements detected from 279 breast cancers (PCAWG + Ferrari et al. study19) using xTea58 (v0.1.6; Supplementary Note and Supplementary Table 6).
Defining focal amplifications
We defined an amplicon as a genomic segment for which the absolute copy number was more than three times greater than the baseline copy number of the chromosome arm. The arm-level baseline was defined as the integer copy number supported by the largest of the combined genomic segments sharing the same copy number. For the chromosome arms where the most common copy number was haploid, we considered diploid as the baseline copy number. Contiguous amplicons were merged, and the amplicons less than 1 kb in size were removed. From the 780 breast cancer cases, we identified 11,490 amplicons. Among these, we focused on 5,502 (48%) amplicons that bordered an unamplified region (Supplementary Table 2). The unamplified regions were defined as those where the copy number was no greater than one above the arm-level baseline. We included regions where the copy number was one copy above the arm-level baseline, because the unequal breakpoints in the two flipped dicentric chromosomes during the TB breakage could result in duplicated genomic segments.
For pan-cancer analysis, we also applied the same steps to the PCAWG consensus calls to define amplicons. Due to the incomplete concordance between copy-number boundaries and SV breakpoints, we created additional copy-number segments on the consensus copy-number calls: we divided a copy-number segment into two when there was an SV breakpoint in the middle, at least 10 bp apart from both boundaries; next, the absolute copy number was re-calculated for each genomic segment from BAM files considering depth, GC content and mappability. Based on this analysis, we focused on 6,586 amplicons adjacent to the unamplified regions.
The SVs at their boundaries were classified into six categories (fold-back inversion, translocation, simple head-to-tail, tandem duplication, intra-chromosomal complex, and no SV support) as described in Fig. 5b. Fold-back inversion were defined as head-to-head or tail-to-tail intra-chromosomal SV with breakpoints less than 5 kb apart. Among the amplicons generated by simple head-to-tail SVs (duplication-like), double minutes (DMs) were defined as amplicons with copy number more than three times greater than that of adjacent segment; while those with a copy number three times or less were classified as tandem duplications. Other amplicons bordered by intra-chromosomal SVs were classified as intra-chromosomal complex rearrangements, which often resulted from chromothripsis. Amplicons without supporting boundary SVs were grouped as ‘no SV support’. Hierarchical clustering of tumour types was performed using their fraction of fold-back inversion, translocation, and DMs at the amplicon boundaries. The optimal number of clusters was determined to be k = 4 due to the distinct differences observed among the groups.
Correlative analysis with TB amplification
Among the 244 breast cancer cases with TB amplification, many displayed the genomic footprints of TB amplification and chromothripsis at the same time. Some cases showed a heavier burden of intra-chromosomal rearrangements than of boundary translocations, suggesting a predominant role of chromothripsis in these cases. To conduct a correlative analysis between TB amplification and driver genomic events, we selected 151 (out of 244) cases exhibiting an extensive footprint of TB amplification with 10 or more translocations between the involved chromosomes. We used the potential driver genetic alterations identified by the PURPLE algorithm, which included recurrently altered genes by mutation (n = 363), germline alterations (n = 15) and deletions (n = 124). We excluded gene amplifications (n = 127) here because many of them were TB amplification. We selected the top 10% of genes in each class and examined their presence in the tumours with and without extensive footprint of TB amplification. We excluded the genetic alterations present in less than 5% of the samples (39 cases). Primary statistical testing was performed by the two-sided Fisher’s exact test with FDR <0.1.
Reconstruction of complex genomic rearrangements
Complex genomic rearrangements were reconstructed as described29. Given the higher structural complexity of the amplicons compared to that of fusion oncogenes, we focused on the SVs at the borders of LOH segments as well as on the amplified SVs, which are more likely to have occurred earlier than the SVs on the already-amplified segments. The amplified SVs were defined based on the abundance of supporting read fragments with respect to a tumour-specific threshold. To determine the threshold, we sorted all SVs in each tumour by the number of supporting read fragments and chose an inflection point, beyond which the increase in the number of supporting reads changed markedly.
To reconstruct the rearrangements, we first connected the chromosomal regions through the amplified SVs. The unamplified SVs within the amplicons were excluded due to their late timing (probably after the ecDNA formation). Then, the SVs outside of the amplicons were connected to finalize the most likely ancestral karyotype. For visualization (for example, Fig. 1c), we plotted the allelic absolute copy number as well as the SVs (vertical lines and connecting arcs) with their number of supporting read fragments (available in the PURPLE output), which provides the relative timing information within the amplified regions. On these plots, we often displayed chromosomes in their flipped orientations (p arm on the right and q arm on the left) for easier mechanistic interpretation. To avoid overlap between the major- and the minor-allele copy-number segments, we subtracted 0.2 from the minor-allele copy numbers. We shaded amplicon regions with orange colour and annotated key oncogenes on top.
Clustering structural variations
We used LINX56 (v1.15) to identify genomic rearrangement clusters. In brief, LINX uses multiple additional criteria to group SVs into clusters other than breakpoint proximity, including: SVs that are phased by a deletion bridge or an LOH segment, translocations connecting two chromosome arms in common, all fold-back inversions on a chromosome arm, and others. In 780 breast cancers, we identified 1,556 complex genomic rearrangement clusters with 10 or more SVs involved. Among these, 295 clusters (found in 245 samples) involved multiple chromosomes and contained boundary translocations, which are the key features of TB amplification (Extended Data Fig. 4b). On average, these clusters contained 137 SVs (range: 10−1,515) and 3.75 boundary translocations (range: 1−33). Fusion genes were analysed as part of this step, and the result was discussed in Supplementary Note (Supplementary Fig. 6).
Analysis of mutational signatures
We calculated the mutational spectra of SNVs and indels using SigProfilerMatrixGenerator59 (v0.1.0) with the SBS-96 and ID-83 classification system. We performed de novo extraction of the mutational signatures, matched them to the reference catalogue, and refitted and validated them using MuSiCal60 (v1.0.0-beta). We used an expanded SBS and ID signatures catalogue described in the MuSiCal manuscript. One ID signature did not match to the ID catalogue but had high similarity to a recently described ID signature commonly observed in people with African ancestry61. We also separately analysed the mutational signatures near the SV breakpoints (Supplementary Fig. 4). In this analysis, the observed SBS mutational spectra were linearly decomposed using SBS1, SBS2, SBS5, SBS13 and SBS18.
Mutational timing analysis
We analysed the timing of copy-number gains for genomic segments larger than 5 Mb. Two approaches were used. First, we used relative timing, the temporal order among the different copy number-gained segments. This could be estimated by the ratio of amplified versus unamplified mutations in each amplified segment, using MutationTimeR algorithm38 (v1.00.2). Synchronous copy-number gain events were determined using the code accompanying a previous publication38 (available at https://gerstung-lab.github.io/PCAWG-11/). Second, we calculated the absolute mutation burden of the ancestral cell at the moment of copy-number gains. We quantified the number of mutations amplified up to the maximal major copy number of the non-bridge arms from the cases with TB amplification. A mutation was assumed to be a pre-amplification event when the estimated copy number of the mutation was larger than the major copy number of the locus times 0.75. If the estimated copy number was smaller, the mutation was classified as post-amplification or on the minor allele. The probabilities of being pre-amplification, post-amplification or on the minor allele, and subclonal mutation were estimated using the binomial distribution as previously described29. We used this approach in estimating the absolute timing of common aneuploidies, including gain of 1q, 8q, 16p (in those cases with a paired 16q loss) and 20q and whole-genome duplication.
Assuming a stable mutation rate in early oncogenesis, we estimated the timing of non-bridge arms and common arm-level copy-number gains. The clonal mutation burden increases with age at a rate of 29.4 mutations per year in our selected cases (n = 147; purity ≥0.6, number of SNVs <10,000, fraction of SBS2 + SBS13 <0.5, no whole-genome duplication, microsatellite stable, and HR proficient; Extended Data Fig. 9b). We divided the median ancestral mutation burden by this rate to estimate the typical age when the common aneuploidy events occurred. When we repeated this analysis using total SNV counts (which may overestimate the rate of accumulation due to the inclusion of all subclonal mutations), we found a rate of 33.1 mutations per year (Extended Data Fig. 9b).
We used the RNA-sequencing data from ref. 15 to quantify the activity of the ER-driven transcriptome. For the 263 samples for which the data were available, we studied the expression of the frequently amplified genes with respect to their amplification status. The findings were validated in the METABRIC cohort14, using their diploid samples (n = 1,904) to minimize the impact of whole-genome duplication. We defined the ER target genes based on the Hallmark gene sets in MSigDB37. The list of 275 genes in the early and late oestrogen-responsive set included well-known ER target genes such as GREB1, TFF1 and PGR. We tested if these genes were differentially expressed between the ER+ and the ER− groups (n = 188 and 69, respectively, with 6 ER-unknown cases). Using the 136 genes showing a significantly higher level of RNA expression in the ER+ group, we determined the ER activity of each tumour, calculating the fraction of genes that had an expression level of 50th percentile or higher. We tested different percentile cutoff values, and the score based on the 50th percentile showed the best separation between the ER+ and ER− cases and a good spread within the ER+ cases (Fig. 4e).
Integration of the CRISPR screen information
To study the functional importance of the amplified genes, we integrated CRISPR screen data from the DepMap project23. Of the 46 breast cancer cell lines studied, ER and HER2 status were available for 41 cell lines. We used the gene effect score as the readout for cellular dependence on a given gene. (0 indicated no viability effect on cells by knockout of the gene; −1 indicated the median cytotoxic effect observed by knockout of common essential genes23). We compared gene effect score among the putative target genes in the amplicons (Supplementary Note).
Integration of the epigenomic data
We used epigenomic profiles from the ENCODE33 and Roadmap Epigenomics62 consortia (accession numbers and further details are provided in Supplementary Table 3). When MACS (v2)63-processed output was not available, we downloaded FASTQ files from GEO and aligned the reads to hg19 using Bowtie (v1.2.2.)64 with the unique mapping option. For generating input-normalized ChIP enrichment tracks and detecting significant peaks, we used MACS2 callpeak and bdgcmp functions with the q-value threshold of 0.01.
To quantify the relative enrichment of an epigenetic feature with respect to amplicon boundaries, we first identified all 100-kb bins that overlap the epigenetic feature, and then compared the number of bins that overlap versus not overlap amplicon boundaries. The significance was calculated using the one-sided Fisher’s exact test unless otherwise specified.
To find associations between the distribution of amplicon boundaries and epigenomic variables, we used the multivariate LASSO regression model, which is more tolerant to the multicollinearity between the variables compared to other linear regression models. A multivariate linear mixed-effect model also supported the conclusion. We evaluated multicollinearity among the epigenetic features by calculating the variance inflation factor (VIF), a standard method for quantifying collinearity between the dependent variables. We considered VIF >5 as concerning and >10 as serious collinearity issues65.
Analysis of three-dimensional chromatin contact
We explored the relationship between the chromatin contact frequencies and the chromosomal regions frequently involved in TB amplification (Supplementary Fig. 8). For the comparison of chromatin interactions between the E2-treated and untreated conditions, we used chromatin conformation capture-based high-throughput sequencing data in untreated- and E2-treated MCF7 cells66. The contact frequencies were combined for each chromosome arm-pair and then compared between the E2-treated and untreated conditions. We also analysed Hi-C data from T47D luminal breast cancer cell line from 4D Nucleome Data Portal (https://data.4dnucleome.org). Contact frequencies were normalized by balance-based method (KR normalization) using Juicer67 to reduce the effects from possible copy-number variations.
Cell lines and cultures
MCF7 (ATCC, HTB-22) and T47D (ATCC, HTB-133) cells were maintained in RPMI 1640 medium (Corning, 15-040-CV) supplemented with 10% fetal bovine serum (FBS; Gibco, 10437-028), 100 U ml−1 penicillin-streptomycin (Corning, 30-002-CI), and 2 mM l-glutamine (Corning, 25-005-CI). For RT–qPCR, cells were cultured in RPMI 1640 medium without phenol red (Corning, 17-105-CV) supplemented with 10% charcoal- and dextran-treated FBS (R&D System, S11650H) and either (1) 0.01% ethanol or (2) 1 μM β-estradiol (Sigma Aldrich, E2758) for 4 days with fresh β-estradiol every 24 h. For the HTGTS experiment, cells were plated in RPMI 1640 medium (Corning, 15-040-CV) supplemented with 10% FBS (Gibco, 10437-028), 100 U ml−1 penicillin-streptomycin, and 2 mM l-glutamine and were transduced with CRISPR–Cas9-containing lentiviral supernatants targeting SHANK intron 10 or RARA intron 1 with 6 μg/ml polybrene. Thirty hours after the lentiviral infection, cells were washed with phosphate-buffered saline (PBS) three times and cultured as described above for RT–qPCR. Cells were collected for the HTGTS library preparation on day 6. 293FT (Invitrogen/ThermoFisher Scientific, R70007) cells were used to produce CRISPR/Cas9-containing lentiviral particles and were maintained in DMEM medium (Corning, 15-017-CV) supplemented with 10% FBS, 100 U ml−1 penicillin-streptomycin, and 2 mM l-glutamine. All cell lines were tested negative for mycoplasma contamination and were cultured at 37 °C in 5% CO2 atmosphere.
Genomic DNA (gDNA) was extracted from MCF7 and T47D cells using rapid lysis buffer (100 mM Tris-HCl pH 8.0, 200 mM NaCl, 5 mM EDTA, 0.2% SDS) containing 10 μg ml−1 proteinase K (P2308, Sigma Aldrich). After overnight incubation at 56 °C, gDNA was precipitated in one volume isopropanol, and the DNA pellet was resuspended in Tris-EDTA buffer. gDNA was used for preparation of the HTGTS library.
Total RNA was isolated from the cells using Rneasy Plus Mini Kit (Qiagen, 74136). cDNA was synthesized using iScript cDNA synthesis kit (Bio-Rad, 1708891). All RT–qPCR experiments were performed in triplicate on Icycler iQ Real-Time PCR Detection System (Bio-Rad) with iTaq universal SYBR green supermix (Bio-Rad, 1725121). Expression levels for individual transcripts were normalized against ACTB. Primers for RT–qPCR are listed in Supplementary Table 7.
Lentiviral particle productions
To produce lentiviral particles, 5.5 × 106 293FT cells were plated in a 10 cm dish a day before the transfection. On the following day, cells were transfected using Xfect transfection reagent (Takara Bio, 631318) with 20 μg of lentiCRISPR–Cas9 plasmid, 3.6 μg of pMD2.G plasmid (Addgene, 12259), 3.6 μg of pRSV-Rev plasmid (Addgene, 12253) and 3.6 μg of pMDLg/pRRE plasmid (Addgene, 12251). The medium was changed with complete culture medium 6 h after transfection. The viral supernatant was collected 48 h post-transfection, passed through a 0.45-μm syringe filter (PVDF membrane; VWR, 89414-902), pooled, and used either fresh or snap frozen.
CRISPR–Cas9 sgRNA design and cloning
For SpCas9 expression and generation of single guide RNA (sgRNA), the 20-nt target sequences were selected to precede a 5′-NGG protospacer-adjacent motif (PAM) sequence. The human SHANK2 intron 10-targeting sgRNA and human RARA intron 1-targeting sgRNA were designed with the CRISPR design tool CRISPick (https://portals.broadinstitute.org/gppx/crispick/public). Oligonucleotides synthesized by Integrated DNA technology were annealed and cloned into the BsmbI–BsmbI sites downstream from the human U6 promoter in lentiCRISPR v2 plasmid (Addgene, 52961). sgRNA sequences were confirmed by Sanger sequencing with U6 promoter primer 5′-GAGGGCCTATTTCCCATGAT-3′. Oligonucleotides for sgRNA cloning are listed in Supplementary Table 7.
High-throughput genome-wide translocation sequencing
HTGTS libraries were generated by the emulsion-mediated PCR (EM-PCR) methods as previously described36. In brief, gDNA was digested with HaeIII enzyme (New England Biolabs, R0108) overnight. HaeIII-digested blunt ends were A-tailed with Klenow fragment (3′→5′ exo-; New England Biolabs, M0212). An asymmetric adaptor (composed of an upper liner and a lower 3′-modified linker; Supplementary Table 7) was then ligated to fragmented DNA. To remove the unrearranged endogenous SHANK2 and RARA locus, ligation reactions were digested with XbaI (New England Biolabs, R0145L) for SHANK2 locus and EcoRI (New England Biolabs, R0101L) for RARA locus, respectively. In the first round of PCR, DNA was amplified using an adaptor-specific forward primer and a biotinylated reverse SHANK2 primer oriented to capture the 5′ portion of SHANK2 junction and using a biotinylated forward RARA primer and an adaptor-specific reverse primer with Phusion High-Fidelity DNA polymerase (ThermoFisher Scientific, F530S). Twenty cycles of PCR were performed in the following conditions: 98 °C for 10 s, 58 °C for 30 s, and 72 °C for 30 s. Biotinylated PCR products were enriched using the Dynabeads MyOne streptavidin C1 (ThermoFisher Scientific, 65002), followed by an additional digestion with blocking enzymes for 2 h. Biotinylated PCR products were eluted from the beads by 30-min incubation with 95% formamide/10mM EDTA at 65 °C, and purified using Gel Extraction Kit (Qiagen, 2870). In the second round of PCR, the purified products were amplified with EM-PCR in an oil-surfactant mixture. The emulation mixture was divided into individual aliquots and PCR was performed using the following conditions: 20 cycles of 94 °C for 30 s, 60 °C for 30 s, and 72 °C for 1 min. The PCR products were pooled and centrifuged for 5 min at 14,000 rpm to separate the PCR product-containing phase and the oil layer. The layer was removed and the PCR products were extracted with diethyl ether three times. EM-PCR amplicons were purified using the Gel Extraction Kit. The third round of PCR (10 cycles) was performed with the same primers as in the second round of PCR, but with the addition of linkers and barcodes for Illumina Mi-seq sequencing. The third round PCR products were size-fractionated for DNA fragments between 300 and 1,000 base pairs on a 1% agarose gel (Bio-Rad, 1613102). The PCR products containing Illumina barcodes were extracted with the Gel Extraction Kit.
The HTGTS libraries were sequenced on Mi-seq (Illumina NS500 PE250) at the Molecular Biology Core Facility of the Dana-Farber Cancer Institute. The libraries were generated from each of the three biological replicate experiments and analysed for each experimental condition. Oligonucleotide primers used for SHANK2 and RARA library preparations are listed in Supplementary Table 7.
The HTGTS data were processed and aligned as previously described68. In brief, the reads for each experimental condition were demultiplexed by designed barcodes. To enhance the specificity and ensure that the analysed sequences contain the bait portion, reads were further filtered by the presence of primer sequence and additional five downstream bases. After the filtering, barcode, primer, and bait portions of the reads were masked for alignment. Then, the processed reads were aligned to GRCh37/hg19 using BLAT. We removed PCR duplicates (reads with same junction position in alignment to the reference genome and a start position in the read less than 3 bp apart), invalid alignments (including alignment scores < 30, reads with multiple alignments having a score difference <4 and alignments having 10-nucleotide gaps), and ligation artefacts (for example, random HaeIII restriction sites ligated to bait break site). The position of HTGTS breakpoints (often referred to as ‘junctions’ in previous publications36,68) were determined based on the genomic position of the 5′ end of the aligned read.
Due to the universal increase of the HTGTS breakpoints by the E2 treatment in all biological replicates and in both cell lines (Extended Data Fig. 7c), we primarily analysed breakpoint ratios in genomic bins between the E2-treated group and the control group (for absolute counts, the normalization-based approach used in previous HTGTS experiments68 negated the effect of the E2 treatment, as discussed in ref. 69). Thus, to investigate the mechanisms underlying the E2-induced translocations, we modelled the ratio of the HTGTS breakpoints (E2-treated/control) in genome-wide bins (250 kb) using multivariate LASSO regression. We used the same epigenomic datasets that were used in the modelling of the amplicon boundaries. In addition, we performed GSEA to study the gene sets enriched in the E2-induced HTGTS breakpoint hotspots. For this analysis, we calculated per-gene HTGTS breakpoint ratios (included ±5 kb upstream and downstream of the gene) and averaged the ratios from four experimental pairs (MCF7/SHANK2, T47D/SHANK2, MCF7/RARA and T47D/RARA) after excluding the bait region (±1 Mb from the CRISPR target site). Based on the ordered list of all genes, a pre-ranked GSEA was performed using the GSEA application (v4.2.3).
Statistics and reproducibility
The statistical tests or methods are described in the figure legends. We used R (v4.1.1) for all data processing and secondary computational analysis. For the HTGTS, we performed three biologically independent experiments per group (defined by cell line and CRISPR targets) as specified in the figure legends.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.