Error running index_db() on a large fastq file: Duplicate key? UNIQUE constraint failed: offset_data.key · biopython/biopython#2038

(25 留言) (0 反應) (0 負責人)Python (3,452 star) (1,580 fork)batch import

Bughelp wanted

描述

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows.

The script is running in a singularity container, which is set up as follows:

>>> import sys; print(sys.version)
3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-Ubuntu-19.04-disco
>>> import Bio; print(Bio.__version__)
1.73

The script is being called from a snakemake workflow, on a host configured as follows (biopython is not installed on the host):

>>> import sys; print(sys.version)
3.6.3 (default, Jan  9 2018, 10:19:07) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-redhat-7.6-Maipo

Expected behaviour

I'm trying to index a large fastq file with 613,974,956 records using SeqIO.index_db().

Actual behaviour

I'm getting an error about duplicate read IDs,using index_db(), but I'm sure there are no duplicate identifiers. I've tried replacing the standard fastq identifier (e.g. @MG00HS20:1017:CAK56ANXX:6:1101:12529:2055 1:N:0:) with a single integer, but the error still occurs.

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 732, in _build_index
    con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Volumes/archive/deardenlab/tomharrop/projects/racon-chunks/.snakemake/scripts/tmpliab3waj.index_reads.py", line 14, in <module>
    'fastq')
  File "/usr/local/lib/python3.7/dist-packages/Bio/SeqIO/__init__.py", line 1032, in index_db
    key_function, repr)
  File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 563, in __init__
    self._build_index()
  File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 738, in _build_index
    raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key

Steps to reproduce

#!/usr/bin/env python3

from Bio import SeqIO

read_file = 'path/to/reads.fq'
db_file = 'path/to/reads.idx'

read_index = SeqIO.index_db(db_file,
                            read_file,
                            'fastq')

I don't know if it's relevant, but I've tried indexing the first 100,000,000 records and it worked. I don't see how to share a reproducible data set for this.

Thanks for reading!

貢獻者指南

技術棧: python
領域: databackend
議題類型: bug
難度: 4
預計時間: 3-5 days
活動狀態: active
清晰度: needs investigation
前置要求: PythonBasic SQLiteBiopython basics
新手友善度: 20
研究方向: Investigate the build index method in Bio/File.py to understand how keys are generated and stored in the SQLite database. The error suggests a UNIQUE constraint violation, which may occur when two different records produce the same key due to hash collision or key truncation. Check the implementation of the key function and the SQLite schema to ensure uniqueness. Also review the comments in the issue for additional insights.