Error running index_db() on a large fastq file: Duplicate key? UNIQUE constraint failed: offset_data.key
#2,038 建立於 2019年4月30日
描述
Setup
I am reporting a problem with Biopython version, Python version, and operating system as follows.
The script is running in a singularity container, which is set up as follows:
>>> import sys; print(sys.version)
3.7.3 (default, Apr 3 2019, 05:39:12)
[GCC 8.3.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-Ubuntu-19.04-disco
>>> import Bio; print(Bio.__version__)
1.73
The script is being called from a snakemake workflow, on a host configured as follows (biopython is not installed on the host):
>>> import sys; print(sys.version)
3.6.3 (default, Jan 9 2018, 10:19:07)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-redhat-7.6-Maipo
Expected behaviour
I'm trying to index a large fastq file with 613,974,956 records using SeqIO.index_db().
Actual behaviour
I'm getting an error about duplicate read IDs,using index_db(), but I'm sure there are no duplicate identifiers. I've tried replacing the standard fastq identifier (e.g. @MG00HS20:1017:CAK56ANXX:6:1101:12529:2055 1:N:0:) with a single integer, but the error still occurs.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 732, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Volumes/archive/deardenlab/tomharrop/projects/racon-chunks/.snakemake/scripts/tmpliab3waj.index_reads.py", line 14, in <module>
'fastq')
File "/usr/local/lib/python3.7/dist-packages/Bio/SeqIO/__init__.py", line 1032, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 563, in __init__
self._build_index()
File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 738, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
Steps to reproduce
#!/usr/bin/env python3
from Bio import SeqIO
read_file = 'path/to/reads.fq'
db_file = 'path/to/reads.idx'
read_index = SeqIO.index_db(db_file,
read_file,
'fastq')
I don't know if it's relevant, but I've tried indexing the first 100,000,000 records and it worked. I don't see how to share a reproducible data set for this.
Thanks for reading!