Fixes involving CRAM and header API for long references.#976
Open
jkbonfield wants to merge 3 commits intosamtools:developfrom
Open
Fixes involving CRAM and header API for long references.#976jkbonfield wants to merge 3 commits intosamtools:developfrom
jkbonfield wants to merge 3 commits intosamtools:developfrom
Conversation
We can now have many cigar ops (eg ref-skips) that combine to more than 512Mb. This isn't a full fix for long references as CRAM sequence positions are still 32-bit. However this is a starting point down that road and there are still sizes between 28-bit and 32-bit where the CIGAR failed to operate correctly.
The h->sdict hash is used to track references that are > 4Gb in size. The dup code didn't copy this. This manifested itself as CRAM SQ headers being truncated (read SAM hdr, dup, write as CRAM hdr). To fix this a function was written that creates or updates the sdict from the hrecs parsed header structs. It's possible this should be called directly from the sam_hdr_create function (part of the SAM format parser) instead of manually keeping track of sdict itself, however doing so would require initialising the new header structs so I haven't done this. This is a general utility, so perhaps should be made a public part of the header API. However IMO the new header API should hide this nuance away and just return the correct data, also ensuring that header updates work correctly and honour the text form. Since c83c9e2 the header API also was using the 32-bit capped target_len in preference to the parsed text from SQ LN fields when they differed. I am assuming this was a decision in what takes priority in BAM where the sequence names and lengths exist in both text and binary form. This commit reverses this and makes the text form always take priority. As this is at least required in some scenarios (long references) it seems easier to simply make it apply in all scenarios.
a914023 to
3c079af
Compare
valeriuo
reviewed
Nov 18, 2019
sam.c
Outdated
| } | ||
|
|
||
| if (h0->sdict) { | ||
| // We have a reason to use new API, so build it so we |
Contributor
There was a problem hiding this comment.
@jkbonfield As discussed, this would be better addressed inside sam_hdr_update_target_arrays.
…h `target_name` and `target_len`. Rename `rebuild_target_arrays` to `sam_hdr_rebuild_target_arrays`. Remove unnecessary external condition on `hrecs->refs_changed`. Rearrange the position of `sam_hrecs_rebuild_text`, to reflect its internal usage.
Contributor
Author
|
Moving the code to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is obviously only partial because we need CRAM4 for long reads, but it's now better at reporting errors and also now copes correctly with the cases inbetween 28-bit and 32-bit sizes for long cigar ops.
The header API has also had some fixes, which were discovered in the process of updating CRAM. The main culprit was sam_hdr_dup which didn't honour h->sdict.