You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

9.1 KiB

Writer

CBAM writing process is managed by FileWriter class. It works with RawReadBlob objects for BioD experimental reader.

Structure of FileWriter:

class FileWriter
{
    File file;
    FileMeta fileMeta;
    BamHeader bamHeader;
    int rowGroupSize;
    RawReadBlob[] buffer;
    int numRows;
    int totalRows;

    this(File fn, BamHeader bamHead){...}
    this(File fn, BamHeader bamHead, int groupSize){...}

    ~this(){...}
    void close(){...}

    void addRecord(RawReadBlob record){...}

    void flushBuf(){...}

    void writeRowGroup(RawReadBlob[] recordBuf, uint num_rows){...}

    void writeFieldToBuf(ubyte[] buf, ColumnTypes columnType, RawReadBlob readBlob, int offset){...}

    void writeVarFieldToBuf(ubyte[] buf, ColumnTypes columnType, RawReadBlob readBlob, ref int offset){...}

    uint calcBufSize(ColumnTypes columnType, RawReadBlob[] recordBuf){...}

    ulong writeColumn(ubyte[] column){...}

    void writeMeta(){...}

    static void writeBamHeader(BamHeader bamHeader, File file){...}

    static void writeToFile(T)(T obj, File file, ubyte[] buf){...}
}

##Description of the fields/methods from top to bottom:


File file;

Filewriter uses File object passed to it when constructing. All interactions with the file happens through this File object.


FileMeta fileMeta;

FileMeta struct holds metainformation about CBAM file. It consist of array of RowGroupMeta structs:

struct FileMeta
{
    RowGroupMeta[] rowGroups;
}

RowGroupMeta struct holds meta information about CBAM file: offsets of column chunks inside file, sizes of column chunks etc.

struct RowGroupMeta
{
    ulong[EnumMembers!ColumnTypes.length] columnsOffsets;
    ulong[EnumMembers!ColumnTypes.length] columnsSizes;
    ulong[EnumMembers!ColumnTypes.length] uncompressedColSizes;
    ulong total_byte_size;
    uint num_rows;
}

First three fields are arrays containing metainformation for columns chunks inside this rowgroup.

Total_byte_size contains rowgroup size in bytes. Num_rows contains number of rows in this rowgroup.

EnumMembers!ColumnTypes.length is a number of elements in ColumnTypes enum:

enum ColumnTypes
{
    _refID,
    _pos,
    _blob_size,
    _bin_mq_nl,
    _flag_nc,
    sequence_length,
    _next_refID,
    _next_pos,
    _tlen,
    read_name,
    raw_cigar,
    raw_sequence,
    raw_qual,
    raw_tags
}

It represents columns (fields of BAM records initially) contained in CBAM file. Enum values are used to describe what type of BAM information any particular column of CBAM file carries. Note, that order of fields in this enum matters.


BamHeader bamHeader;

Holds original BAM file header.


int rowGroupSize;

Represents maximum amount of records in one row group.


RawReadBlob[] buffer;

Accumulates RawReadBlobs (BAM records) before processing and writing them to file.


int numRows;

Holds amount of valid records inside buffer. When FileWriter flushes buffer to file, it sets numRows value to 0, thus invalidating all records inside buffer.


int totalRows;

Holds number of written records.


this(File fn, BamHeader bamHead)
{
    this(fn, bamHeader, DEFAULT_SIZE);
}

this(File fn, BamHeader bamHead, int groupSize)
{
    rowGroupSize = groupSize;
    buffer.length = rowGroupSize;
    file = fn;
    file.rawWrite(CBAM_MAGIC);
    bamHeader = bamHead;
    numRows = 0;
    totalRows = 0;
}

Constructors.


~this()
{
    close();
}

void close()
{
    if (!file.isOpen)
        return;
    flushBuf();
    writeMeta();
    file.close();
}

Destructor and close function, it may be called manually.


void addRecord(RawReadBlob record)
{
    if (numRows == rowGroupSize)
    {
        flushBuf();
        numRows = 0;
    }
    buffer[numRows++] = record;
}

Used to add records to CBAM file. Flushes buffer to file automatically.


void flushBuf()
{
    writeRowGroup(buffer[0..numRows], numRows); 
}

Calls writeRowGroup to write only valid records from buffer to file.


void writeRowGroup(RawReadBlob[] recordBuf, uint num_rows){...}

Following code chunks are from this function.

Manages writing formed rowgroup to the file. Fills bytes buffer and writes it to file.

RowGroupMeta rowGroupMeta;
rowGroupMeta.num_rows = num_rows;
totalRows += numRows;

uint total_size = 0;

Preparation of meta class and fields.

ubyte[] buf;
buf.length = num_rows * int.sizeof;

Byte buffer. Used to hold bytes extracted from BAM reads fields before writing. Initialized to the length required to hold all rowgroup’s records fields values in byte form. Every BAM record fixed size field is of 4 byte size, hence we can preallocate the buffer.

Notice, that to avoid reallocation, byte buffer will be filled firstly with fixed size fields, since they all occupy same space, and only then variable size fields.

foreach (columnType; EnumMembers!ColumnTypes){...}

Foreach loop which manages byte buffer filling. Iterates on previously defined ColumnTypes enum. Values in Enum ordered in such way, that byte buffer won’t be reallocated until variable size fields come - first nine values in enum represent fixed size fields, and they get processed in loop first.

foreach (columnType; EnumMembers!ColumnTypes)
{
    if (columnType < ColumnTypes.read_name)
    {
        rowGroupMeta.columnsOffsets[columnType] = file.tell(); // line 1
        for (int i = 0; i < num_rows; ++i)
        {
            writeFieldToBuf(buf, columnType, recordBuf[i], i * simple_field_size);
        }
        rowGroupMeta.uncompressedColSizes[columnType] = buf.length;
        rowGroupMeta.columnsSizes[columnType] = writeColumn(
                buf[0 .. num_rows * simple_field_size]);
    }
    else
    {...}
}

All values in enum before ColumnTypes.read_name represent fixed size fields.

rowGroupMeta.columnsOffsets[columnType] = file.tell();

Saves position in file where column chunk begin. Notice, that there were reports that file.tell() may return wrong values.

for (int i = 0; i < num_rows; ++i)
{
    writeFieldToBuf(buf, columnType, recordBuf[i], i * simple_field_size);
}

Extracts record field corresponding to specified column and writes it to the byte buffer at offset.

rowGroupMeta.uncompressedColSizes[columnType] = buf.length;
rowGroupMeta.columnsSizes[columnType] = writeColumn(
        buf[0 .. num_rows * simple_field_size]);

Saves uncompressed and compressed column chunk sizes.

foreach (columnType; EnumMembers!ColumnTypes)
{
    if (columnType < ColumnTypes.read_name)
    {...}
    else
    {
        buf.length = calcBufSize(columnType, recordBuf) + int.sizeof * num_rows;
        rowGroupMeta.columnsOffsets[columnType] = file.tell();

        int currentPos = 0;
        for (int i = 0; i < num_rows; ++i)
        {
            writeVarFieldToBuf(buf, columnType, recordBuf[i], currentPos);
        }
        rowGroupMeta.uncompressedColSizes[columnType] = buf.length;
        rowGroupMeta.columnsSizes[columnType] = writeColumn(buf[0 .. currentPos]);
    }
}

Writes variable size fields. Calculates the byte buffer length needed to keep data. In comparison to fixed size part, has currentPos for storing offset in byte buffer, since fields are variable size and offset can’t be simply calculated.

rowGroupMeta.total_byte_size = reduce!((a, b) => a + b)(rowGroupMeta.columnsSizes);
fileMeta.rowGroups ~= rowGroupMeta;

Calculates total byte size of rowgroup and saves rowgroup meta to the file array of rowgroups meta.


void writeFieldToBuf(ubyte[] buf, ColumnTypes columnType, RawReadBlob readBlob, int offset)

Extract fields bytes and saves them to byte buffer.

switch (columnType)
{
    case ColumnTypes._refID:
    {
        std.bitmanip.write!(int, Endian.littleEndian, ubyte[])(buf,
                readBlob.refid, offset);
        break;
    }
    case ColumnTypes._pos:
    {
        std.bitmanip.write!(int, Endian.littleEndian, ubyte[])(buf, readBlob.pos, offset);
        break;
    }
    case ColumnTypes._blob_size:
    {
        uint blob_size = cast(int) readBlob._data.length;
        std.bitmanip.write(buf, blob_size, offset);
        break;
    }
    case ColumnTypes._bin_mq_nl:
    {
        buf[offset .. offset + simple_field_size] = readBlob.raw_bin_mq_nl;
        break;
    }
    case ColumnTypes.sequence_length:
    {
        buf[offset .. offset + simple_field_size] = readBlob.raw_sequence_length;
        break;
    }
    case ColumnTypes._flag_nc:
    {
        buf[offset .. offset + simple_field_size] = readBlob.raw_flag_nc;
        break;
    }
    case ColumnTypes._next_pos:
    {
        buf[offset .. offset + simple_field_size] = readBlob.raw_next_pos;
        break;
    }
    case ColumnTypes._next_refID:
    {
        buf[offset .. offset + simple_field_size] = readBlob.raw_next_refID;
        break;
    }
    case ColumnTypes._tlen:
    {
        buf[offset .. offset + simple_field_size] = readBlob.raw_tlen;
        break;
    }
    default:
    {
        assert(false, "No such type exists");
    }
}