biopython/biopython

Error running index_db() on a large fastq file: Duplicate key? UNIQUE constraint failed: offset_data.key

Open

#2,038 建立於 2019年4月30日

在 GitHub 查看
 (25 留言) (0 反應) (0 負責人)Python (3,452 star) (1,580 fork)batch import
Bughelp wanted

描述

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows.

The script is running in a singularity container, which is set up as follows:

>>> import sys; print(sys.version)
3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-Ubuntu-19.04-disco
>>> import Bio; print(Bio.__version__)
1.73

The script is being called from a snakemake workflow, on a host configured as follows (biopython is not installed on the host):

>>> import sys; print(sys.version)
3.6.3 (default, Jan  9 2018, 10:19:07) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-redhat-7.6-Maipo

Expected behaviour

I'm trying to index a large fastq file with 613,974,956 records using SeqIO.index_db().

Actual behaviour

I'm getting an error about duplicate read IDs,using index_db(), but I'm sure there are no duplicate identifiers. I've tried replacing the standard fastq identifier (e.g. @MG00HS20:1017:CAK56ANXX:6:1101:12529:2055 1:N:0:) with a single integer, but the error still occurs.

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 732, in _build_index
    con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Volumes/archive/deardenlab/tomharrop/projects/racon-chunks/.snakemake/scripts/tmpliab3waj.index_reads.py", line 14, in <module>
    'fastq')
  File "/usr/local/lib/python3.7/dist-packages/Bio/SeqIO/__init__.py", line 1032, in index_db
    key_function, repr)
  File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 563, in __init__
    self._build_index()
  File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 738, in _build_index
    raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key

Steps to reproduce

#!/usr/bin/env python3

from Bio import SeqIO

read_file = 'path/to/reads.fq'
db_file = 'path/to/reads.idx'

read_index = SeqIO.index_db(db_file,
                            read_file,
                            'fastq')

I don't know if it's relevant, but I've tried indexing the first 100,000,000 records and it worked. I don't see how to share a reproducible data set for this.

Thanks for reading!

貢獻者指南