How to ignore top rows in CSV or TSV files

Working with CSV or TSV files, we often ought to ignore the header row (aka. the column headers) before reading the rest of the file. To strip out the first line from the beginning of a file, each programming language has a specific technique to bypass it. We will present you how to do it in Groovy and Python now.

Presumably, we are processing our data stored in a CSV file named data.csv for further reference in the rest of this post.

Introduction of data structure

As you are used to being familiar with CSV files, they look like any other table-based data structures as well as spreadsheets like Microsoft Excel, and Google Spreadsheets.

Entity,Qualifiers,Preferred Ontologies,Value,Comments/Metadata,DOME
Model,bqbiol:hasProperty,Experimental Factor Ontology EFO,https://identifiers.org/EFO:0011061,Toxicity,
Model,bqbiol:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C61408,synthetic accessibility ,
Model,bqbiol:hasProperty,BioAssay Ontology BAO,https://identifiers.org/BAO:0000009,ADMET,
Model,bqbiol:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C45329,Active,
Model,bqbiol:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C154407,Inactive,
Model,bqbiol:hasProperty,Bioinformatics Concept EDAM,https://identifiers.org/bptl/edam:topic_0154,Small molecules,
Model,bqbiol:hasProperty,BioAssay Ontology BAO,https://identifiers.org/BAO:0002305,QSAR,
Model,bqbiol:hasProperty,Bioinformatics Concept EDAM,https://identifiers.org/bptl/edam:topic_3336,Drug discovery,
Model,bqmodel:hasProperty,Bioinformatics Concept EDAM,https://identifiers.org/bptl/edam:topic_3474,Machine learning,
Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C16309,Artificial Intelligence,
Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C176258,Deep Learning,
Model,bqmodel:hasProperty,Ontology for MIRNA Target OMIT,https://identifiers.org/OMIT:0004946,Decision Tree,Optimization-Algorithm
Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C53237,Regression Method,Optimization-Algorithm
Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C17429,Neural network,Optimization-Algorithm
Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000415,accuracy,Evaluation-Performance Measure
Model,bqmodel:hasProperty,OBCS: Ontology of Biological and Clinical Statistics OBCS,http://identifiers.org/OBCS:0000058,sensitivity,Evaluation-Performance Measure
Model,bqmodel:hasProperty,STATO: the statistical methods ontology,http://identifiers.org/STATO:0000053,false positive rate,Evaluation-Performance Measure
Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000524,Matthews correlation coefficient,Evaluation-Performance Measure
Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000274,AUC–ROC,Evaluation-Performance Measure
Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000037,mean squared error (MSE),Evaluation-Performance Measure
Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C65172,Pearson correlation coefficient (PCC),Evaluation-Performance Measure
Model,bqmodel:hasProperty,Ontology for Biomedical Investigations OBI,https://identifiers.org/obi:OBI_0200032,5-fold cross-validation,Evaluation-Method
Model,bqmodel:isDescribedBy,Ersilia Model Hub,https://github.com/ersilia-os/eos92sw,Ersilia Incorporation URL,Model-Executable form
Model,bqmodel:isDescribedBy,GitHub Repository,https://github.com/pulimeng/eToxPred,eToxPred Source Code,Model-Source code
Model,bqmodel:isDescribedBy,PubMed Identification Number PMID,https://identifiers.org/pubmed:30621790,PubMed URL,
Model,bqbiol:hasInput,Chemical information ontology (cheminf),https://identifiers.org/CHEMINF:000018,Smiles descriptors,Data - Input
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://nubbe.iq.unesp.br/portal/nubbe-search.html,NuBBE Dataset,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,http://pkuxxj.pku.edu.cn/UNPD,Universal Natural Products Database (UNPD),Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://dudez.docking.org/,"Data-base of Useful Decoys, Extended (DUD-E)",Data - Source
Model,bqbiol:hasDataset,BioAssay Ontology BAO,https://identifiers.org/BAO:0700004,FDA-approved drugs,Data - Source
Model,bqbiol:hasDataset,Molecular Interactions oOntology MI,https://identifiers.org/MI:0470,Kyoto Encyclopedia of Genes and Genomes (KEGG) Compound,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://web.archive.org/web/20191001095455/https://toxnet.nlm.nih.gov/,TOXNET Database,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,http://www.t3db.ca/,"Toxin and Toxin Target Database (T3DB),",Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,http://tcm.cmu.edu.tw/,Traditional Chinese Medicine (TCM) Database,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://files.toxplanet.com/cpdb/index.html,Carcinogenicity Potency (CP) database,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://tox.charite.de/protox3/,SuperToxic database,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C47824,Cardiotoxicity (CD) dataset,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C47824,Endocrine disruption (ED) dataset,Data - Source
Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C47824,Acute oral toxicity (AO) dataset,Data - Source
Model,bqbiol:hasOutput,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C25338,synthetic accessibility (SA) score,Model-output
Model,bqbiol:hasOutput,Experimental Factor Ontology EFO,https://identifiers.org/EFO:0011061,toxicity (Tox) score,Model-output%

We will use the content above to exercise all approaches presented in this post.

With Groovy and Java

In Java

Approach: Read the file line by line in Java and associate it with the standard loops.

//This script strips the first line out of each document and outputs the rest of the document contents unaltered.
newline = System.getProperty("line.separator");

for( int i = 0; i < dataContext.getDataCount(); i++ ) {
  InputStream is = dataContext.getStream(i);
  Properties props = dataContext.getProperties(i);

  reader = new BufferedReader(new InputStreamReader(is));
  outData = new StringBuffer();
  lineNum = 0;

  while ((line = reader.readLine()) != null) {
    // Skip first line
    if (lineNum==0) {
      lineNum++;
      continue;
    }

    outData.append(line);
    outData.append(newline);
  }

  is = new ByteArrayInputStream(outData.toString().getBytes());
  dataContext.storeStream(is, props);
}

In Groovy

Approach: Use FileReader.eachLine with a counter starting at 1.  The first line is when the counter equals 1.

new FileReader('data.csv').eachLine { line, number ->
    if (number == 1) {
        return // continue
    }
    // process the rest of the file from here
    println "$number: $line"
}

With Python

We love working with Python because of its beauty and efficiency.

Using next()

We use the data.csv file to read the contents. This method uses next() to skip the header and starts reading the file from line 2.

Note: If you want to print the header later, instead of next(f) use f.readline() and store it as a variable or use header_line = next(f). This shows that the header line of the file is stored in next().

with open("data.csv") as f:
    next(f)
    for line in f:
        print(line)

f.close()

In particular, CSV files can be handled similarly by using the csvmodule to read the entire file into memory.

import csv

with open("data.csv", 'r') as r: 
        next(r)                  
        #skip headers             
        rr = csv.reader(r)
        for row in rr:
            print(row)

 

Using readlines()

We use the data.csv file to read the contents. This method uses readlines() to skip the header and starts reading the file from line 2. readlines() uses the slicing technique. As you can see in the below example, readlines[1:] , it denotes that the reading of the file starts from index 1 as it skips the index 0. This is a much more powerful solution as it generalizes to any line. The drawback of this method is that it works fine for small files but can create problems for large files. Also, it uses unnecessary space because slice builds a copy of the contents.

f = open("data.csv",'r')

# skips the header
lines = f.readlines()[1:]
print(lines)

f.close()

 

Using islice()

We use the data.csv file to read the contents. This method imports islice from islice() , a module in Python. The method takes three arguments. The first argument is the file to read the data, the second is the position from where the reading of the file will start and the third argument is None which represents the step. This is an efficient and pythonic way of solving the problem and can be extended to an arbitrary number of header lines. This even works for in-memory uploaded files while iterating over file objects.

from itertools import islice

with open("data.csv") as f:
     for line in islice(f, 1, None):
        print(line)

f.close()

Voilà!

Summary

Those approaches also work for any type of file. Although we have shown how to treat CSV or TSV files you can apply any of the approaches with your data of interest.

References

[1] How to Remove the First Line of a Document using Groovy, https://community.boomi.com/s/article/howtoremovefirstlineofadocumentusinggroovy, accessed on June 7th, 2024.

[2] How do I use the firstLine argument in eachLine, https://stackoverflow.com/questions/2699865/how-do-i-use-the-firstline-argument-in-eachline, accessed on June 7th, 2024.

[3] How to Read a File from Line 2 or Skip the Header Row?, https://www.studytonight.com/python-howtos/how-to-read-a-file-from-line-2-or-skip-the-header-row, accessed on June 7th, 2024.

[4] Read file from line 2 or skip header row, https://stackoverflow.com/questions/4796764/read-file-from-line-2-or-skip-header-row, accessed on June 7th, 2024.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.