Audits
What are audits?
The ENCODE Data Coordination Center has implemented a system of audits, or flags, to provide additional information to the research community about the quality of the data on the portal. These flags may indicate an error in the experimental metadata, or may indicate that the data itself does not meet some aspect of the Consortium's standards. The flag's color corresponds to the severity of the problem.
Read Coverage Audits
Audit message | Explanation |
---|---|
Control insufficient read depth Control low read depth |
A ChIP-seq control should be sequenced to similar depth as the experiment. For this experiment, the control depth is insufficient. |
Extremely low read depth Insufficient read depth Low read depth |
The read depth of the alignment files is below the given threshold for the assay type. For a list of standards for individual assay types, see the following links: WGBS DNase-seq Bulk RNA-seq, shRNA, CRISPR knockdown Small RNA-seq RAMPAGE ChIP-seq ATAC-seq microRNA-seq microRNA counts RNA Bind-n-Seq eCLIP |
Extremely low coverage Insufficient coverage Low coverage |
Each biological replicate processed by the WGBS paired-end pipeline is recommended to have 30X coverage. Replicates with coverage between 25X - 30X receive a "Low coverage" audit; replicates with coverage between 5X - 25X receive an "Insufficient coverage" audit; and replicates with coverage less than 5X receive an "Extremely low coverage" audit. |
Insufficient number of aligned reads Borderline number of aligned reads |
The alignments file for a micro RNA-seq experiment did not have a sufficient number of aligned reads. For more information on data standards, please visit the micro-RNA-seq page. |
Replication Audits
Audit message | Explanation |
---|---|
Insufficient replicate concordance Low replicate concordance Borderline replicate concordance |
The value for the metric used to measure replicate concordance is below the given threshold for the assay type, so the experiment has been flagged for poor reproducibility. For a list of standards for individual assay types, see the following links: WGBS |
Unreplicated experiment | ENCODE experiments (excluding ENTEx/GTEx) are required to have at least two biological replicates. Experiments using samples from the GTEx consortium do not require more than one replicate because of the limited availability of tissues. |
Inconsistent replicate | A file came from a replicate that belongs to a different experiment from the one in which the file is found, or the replicate numbers do not match between parent and derived files. |
Technical replicates with not identical biosample | Two technical replicates do not share the same biosample. |
Replicate with no library | The library created from the experimental replicate was not uploaded and/or attached to the replicate. |
Insufficient number of reproducible peaks Moderate number of reproducible peaks |
The ATAC-seq dataset lacks sufficient numbers of reproducible peaks in the overlap peaks or IDR thresholded peaks files. |
Library Complexity Audits
Audit message | Explanation |
---|---|
Severe bottlenecking Mild to moderate bottlenecking |
Bottlenecking for ChIP-seq assays is measured using PCR Bottlenecking Coefficients 1 and 2 (PBC1 and PBC2). For more information on the expected values, see the ChIP-seq standards page. |
Insufficient library complexity Moderate library complexity Poor library complexity |
Library complexity for ChIP-seq experiments is measured using the Non-Redundant Fraction. For more information on the expected values for NRF, see the ChIP-seq standards page. |
Enrichment Audits
Audit Message | Explanation |
---|---|
Extremely low SPOT score |
The SPOT (Signal Portion of Tags) score is a measure of enrichment used in DNase-seq experiments. For more information on the expected values, please see the DNase-seq standards page. |
Extremely low TSS enrichment |
Transcription Start Site (TSS) enrichment values for alignments in an ATAC-seq assay are below standards. The ideal value is > 7, with 5-7 being acceptable and < 5 being non compliant. For more information on the expected values, please see the ATAC-seq standards page. |
FRiP (fraction of reads in called peak regions) for overlap peak files in an ATAC-seq assay are below standards. The ideal value is >0.3, with 0.2-0.3 being acceptable and <0.2 being non compliant. For more information on the expected values, please see the ATAC-seq standards page. |
Uniform Pipeline Requirements
Audit message | Explanation |
---|---|
Missing spikeins | Bulk RNA-seq, shRNA knockdown, and CRISPR editing followed by RNA-seq assays require spike-ins, but they are missing in the given experiment. |
Missing RNA fragment size | The library created for an RNA-seq experiment lacks information on the size of the library fragments. |
Missing input control | ChIP-seq experiments must have at least one input control, but the control given is not an input control. For example, it may be a mock immunoprecipitation instead. |
Missing run_type | Fastq file does not contain information on the run type used to produce it (single vs paired end). |
Inconsistent control read length Inconsistent control run type Inconsistent control platform |
The uniform pipelines expect that the controls and the files they control share identical run_types and read lengths. Otherwise, trimming may be required. Similarly, the pipelines expect the same or comparable sequencing platforms. |
Inconsistent platforms |
The uniform pipelines expect files within an experiment to have been produced from the same or comparable sequencing platforms. |
Mixed read lengths Mixed run types |
A single experiment contains fastq files with different read lengths and/or run types, either within or among replicates. |
Extremely low read length Insufficient read length Low read length |
The sequencing read length is below the given threshold for the assay type. For a list of standards for individual assay types, see the following links: WGBS DNase-seq Bulk RNA-seq, shRNA, CRISPR knockdown Small RNA-seq RAMPAGE ChIP-seq ATAC-seq microRNA-seq microRNA counts RNA Bind-n-Seq eCLIP |
Non-standard run_type |
The pipelines for specific assay types require either single-end or paired-end sequencing, but the required sequencing type was not performed. For a list of standards for individual assay types, see the following links: |
Not compliant platform | The sequencing platform used is not compatible with the processing pipeline. For a list of standards for individual assay types, see the following links: WGBS DNase-seq Bulk RNA-seq, shRNA, CRISPR knockdown Small RNA-seq RAMPAGE ChIP-seq ATAC-seq microRNA-seq microRNA counts RNA Bind-n-Seq eCLIP |
Antibody Audits
Audit message | Explanation |
---|---|
Duplicate lane review | The flagged antibody has already been reviewed for the biosample type in question. For example, there may be multiple lanes in a single western blot that are characterizing the antibody is K562. For more on antibody standards, please see the antibody characterization guidelines for transcription factors, chromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays. |
Not tagged antibody Inconsistent target |
The antibody and experimental targets of interest do not match. This may be because the experimental target is tagged and the antibody does not apply to that tag, or because the target proteins are completely different between the antibody and the experiment. |
Mismatched tag target |
The antibody target and experiment target are not the same. This may be a metadata error and should be clarified by the lab and the DCC if they are meant to be the same, or if the experiment is using a different antibody with the proper target. |
Not eligible antibody | The antibody used in the experiment is not eligible for use because it has not been fully characterized in the biosample type (e.g. liver tissue or K562) used by the experiment. |
Partially characterized antibody | The antibody used in the experiment has either its primary or secondary characterization (but not both) for the given biosample (e.g. liver tissue or K562) used by the experiment. |
Uncharacterized antibody | The antibody used in the experiment is lacking a primary and secondary characterization for the given biosample (e.g. liver tissue or K562) used by the experiment. |
Antibody not characterized to standard | The antibody used in the experiment has non-compliant characterizations, and no compliant characterizations for in the biosample type (e.g. liver tissue or K562) used by the experiment. |
Antibody characterized with exemption |
The antibody used in the assay did not pass its primary characterization test, but the secondary characterization was able to rescue the primary and it passed with exemption. For more on antibody standards, please see the antibody characterization guidelines for transcription factors, chromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays. |
Characterizations not reviewed |
The antibody has old characterizations, perhaps from previous iterations of ENCODE, that were not reviewed or submitted for review. |
No characterizations submitted |
The antibody lacks any attempt at characterization. For more on antibody standards, please see the antibody characterization guidelines for transcription factors, chromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays. |
No primary characterizations |
The antibody does not have any attempt at primary characterization in accordance with the ENCODE antibody characterization standards. For more on antibody standards, please see the antibody characterization guidelines for transcription factors, chromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays. |
No secondary characterizations |
The antibody does not have any attempt at secondary characterization in accordance with the ENCODE antibody characterization standards. For more on antibody standards, please see the antibody characterization guidelines for transcription factors, chromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays. |
Need compliant primaries |
Any and all attempts at primary characterization of this antibody do not meet the standards. |
Need compliant secondaries |
Any and all attempts at secondary characterization of this antibody do not meet the standards. |
Metadata Audits
Audit message |
Explanation |
---|---|
Missing antibody |
If an assay type requires any of the following, but the required property was not provided, the experiment is given a flag: antibody, ontology name or ID for the assay type, biosample used to make the library, the ontology term or ID for the biosample, the type of biosample (e.g. immortalized cell line or tissue), the biosample’s donor, the target or molecule of interest, the transfection type (i.e. stable or transient), or protocol documents. |
Multiple paired_with | The raw files are the product of paired ended sequencing, and the file in question has been marked as paired with more than one other file, which is not allowed. |
Missing raw data in replicate | Each experimental replicate's library must have a corresponding raw sequencing file, such as a fastq. |
Missing derived_from | A processed file should have information on the files from which it was derived. For example, an alignment file should indicate which raw data files and references indices were used to create it. |
Missing control alignments | A peaks file must specify the control alignments used to generate it. |
Missing possible controls, |
ChIP-seq, RAMPAGE, and CAGE experiments all require controls. A flag appears if the control is missing or has a different biosample type from the experiment it controls (e.g. K562 versus MCF-7). A flag will also appear if the control files are not matched to corresponding experimental files; this information is stored in a property called "controlled_by" in the experimental file object. Please note that the "missing possible controls" assay has different flag colors depending on the assay type and the project phase. |
Missing genotype Missing external identifiers |
Biosample donors for worms and flies (Caenorhabditis and Drosophila) must have their genotypes listed in accordance with the nomenclature rules (for fly, for worm), and must have external references (e.g. GEO/SAMN IDs) listed. |
Inconsistent ontology term | The ontology term does not match the ontology ID provided. |
Inconsistent depleted_in_term length Depleted_in length mismatch |
Some tissue type was removed from the biosample before library creation. The list of ontology term names of the tissues removed does not match the list of the ontology term IDs in either the biosample or the library. |
Inconsistent organism Inconsistent donor Inconsistent library biosample Inconsistent age Inconsistent sex |
The biosamples used for each replicate within an experiment do not share the given properties. |
Inconsistent paired_with | Two read pair files from paired-end sequencing are annotated as belonging to different experiment replicates. |
Missing paired_with | A paired end fastq in this dataset is missing metadata on its paired file. |
Inconsistent read count | A fastq in this dataset is paired with a fastq with a different read count. |
Inconsistent target of control experiment |
A control experiment does not have its target annotated as “control” in the metadata. Rather, the target is some transcription factor or chromatin modifier. |
Inconsistent control | The experiment and its control were not performed on the same type of biosample, e.g. same cell line or tissue type; the control file is of a different format than the experimental file being controlled, e.g. fastq vs. idat; or the control file being used is from a control experiment that is not listed in the possible_controls property of the experiment. |
Inconsistent document_type | A document has been attached to a file, but does not describe the file format specifications for that file. |
Inconsistent mutated_gene organism | The organism from which the biosample came does not match the organism of the mutated gene in the donor. Donor mutated_gene should be of the same species as the donor and biosample. |
Invalid donor mutated_gene | Donor mutated genes should not be tags, controls, recombinant proteins, or modifications |
Invalid dates | The date that the cell culture was harvested precedes the date on which the culture was started. |
Invalid possible_control | The experiment being used as a control is not designated as a control in the metadata. |
Invalid depleted_in_term_id | Before sequencing the library, a specific type of nucleic acid (e.g. polyA RNA) was removed. The nucleic acid that was sequenced is listed as the same nucleic acid that was removed. |
Unexpected step_run | The incorrect pipeline step was attached to the file in question, e.g. a peak calling step that outputs peak files was attached to an alignment file instead. |
Missing analysis step_run | A processed file in the dataset has no metadata about the pipeline run that generated it. |
Matching md5 sums | A processed file in this dataset is identical to another file. |
Lacking processed data | The dataset has no downstream processed data. |
Missing genetic modification characterization Missing genetic modification characterization |
A genetic modification used in this experiment is lacking a characterization. |
Missing biosample characterization Missing biosample characterization |
A genetically modified biosample used in this experiment is lacking a characterization. |
File validation error | A file in the dataset failed to pass automated validation checks during the submission process. |
Missing fragmentation method | Hi-C libraries must specify the fragmentation method used to generate them. |
Missing genetic modification reagents | Genetic modifications should specify any reagents used to perform the modification. |
Missing queried RNP size range | eCLIP libraries should specify the queried RNP size range. |
Inconsistent assembly | A file in this dataset is aligned to an assembly different from the assembly of the file it derives from. |
Improper control type of control experiment | The experiment lacks a control experiment with control_type of "input library" or "wild type". |
Missing control type of control experiment | The control experiment has no control_type specified. |
Unexpected target of control experiment | The experiment is a control experiment, but has a specified target. |
Dataset Consistency Audits
Audit Message | Explanation |
---|---|
Missing reference |
Publication file sets should be linked to a specific publication. |
Missing IHEC required assay |
Reference Epigenome datasets must have at least one of each of the IHEC required assays. |
Multiple donors in reference epigenome |
A reference epigenome dataset has experiments conducted in biosamples from more than one donor. |
Multiple biosample treatments in reference epigenome |
A reference epigenome dataset should not have multiple kinds of treatments between experiments, even if type of the biosample used is the same. |