Working with CSV or TSV files, we often ought to ignore the header row (aka. the column headers) before reading the rest of the file. To strip out the first line from the beginning of a file, each programming language has a specific technique to bypass it. We will present you how to do it in Groovy and Python now.
Presumably, we are processing our data stored in a CSV file named data.csv for further reference in the rest of this post.
Table of Contents
Introduction of data structure
As you are used to being familiar with CSV files, they look like any other table-based data structures as well as spreadsheets like Microsoft Excel, and Google Spreadsheets.
Entity,Qualifiers,Preferred Ontologies,Value,Comments/Metadata,DOME Model,bqbiol:hasProperty,Experimental Factor Ontology EFO,https://identifiers.org/EFO:0011061,Toxicity, Model,bqbiol:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C61408,synthetic accessibility , Model,bqbiol:hasProperty,BioAssay Ontology BAO,https://identifiers.org/BAO:0000009,ADMET, Model,bqbiol:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C45329,Active, Model,bqbiol:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C154407,Inactive, Model,bqbiol:hasProperty,Bioinformatics Concept EDAM,https://identifiers.org/bptl/edam:topic_0154,Small molecules, Model,bqbiol:hasProperty,BioAssay Ontology BAO,https://identifiers.org/BAO:0002305,QSAR, Model,bqbiol:hasProperty,Bioinformatics Concept EDAM,https://identifiers.org/bptl/edam:topic_3336,Drug discovery, Model,bqmodel:hasProperty,Bioinformatics Concept EDAM,https://identifiers.org/bptl/edam:topic_3474,Machine learning, Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C16309,Artificial Intelligence, Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C176258,Deep Learning, Model,bqmodel:hasProperty,Ontology for MIRNA Target OMIT,https://identifiers.org/OMIT:0004946,Decision Tree,Optimization-Algorithm Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C53237,Regression Method,Optimization-Algorithm Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C17429,Neural network,Optimization-Algorithm Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000415,accuracy,Evaluation-Performance Measure Model,bqmodel:hasProperty,OBCS: Ontology of Biological and Clinical Statistics OBCS,http://identifiers.org/OBCS:0000058,sensitivity,Evaluation-Performance Measure Model,bqmodel:hasProperty,STATO: the statistical methods ontology,http://identifiers.org/STATO:0000053,false positive rate,Evaluation-Performance Measure Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000524,Matthews correlation coefficient,Evaluation-Performance Measure Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000274,AUC–ROC,Evaluation-Performance Measure Model,bqmodel:hasProperty,STATO: the statistical methods ontology,https://identifiers.org/STATO:0000037,mean squared error (MSE),Evaluation-Performance Measure Model,bqmodel:hasProperty,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C65172,Pearson correlation coefficient (PCC),Evaluation-Performance Measure Model,bqmodel:hasProperty,Ontology for Biomedical Investigations OBI,https://identifiers.org/obi:OBI_0200032,5-fold cross-validation,Evaluation-Method Model,bqmodel:isDescribedBy,Ersilia Model Hub,https://github.com/ersilia-os/eos92sw,Ersilia Incorporation URL,Model-Executable form Model,bqmodel:isDescribedBy,GitHub Repository,https://github.com/pulimeng/eToxPred,eToxPred Source Code,Model-Source code Model,bqmodel:isDescribedBy,PubMed Identification Number PMID,https://identifiers.org/pubmed:30621790,PubMed URL, Model,bqbiol:hasInput,Chemical information ontology (cheminf),https://identifiers.org/CHEMINF:000018,Smiles descriptors,Data - Input Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://nubbe.iq.unesp.br/portal/nubbe-search.html,NuBBE Dataset,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,http://pkuxxj.pku.edu.cn/UNPD,Universal Natural Products Database (UNPD),Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://dudez.docking.org/,"Data-base of Useful Decoys, Extended (DUD-E)",Data - Source Model,bqbiol:hasDataset,BioAssay Ontology BAO,https://identifiers.org/BAO:0700004,FDA-approved drugs,Data - Source Model,bqbiol:hasDataset,Molecular Interactions oOntology MI,https://identifiers.org/MI:0470,Kyoto Encyclopedia of Genes and Genomes (KEGG) Compound,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://web.archive.org/web/20191001095455/https://toxnet.nlm.nih.gov/,TOXNET Database,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,http://www.t3db.ca/,"Toxin and Toxin Target Database (T3DB),",Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,http://tcm.cmu.edu.tw/,Traditional Chinese Medicine (TCM) Database,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://files.toxplanet.com/cpdb/index.html,Carcinogenicity Potency (CP) database,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://tox.charite.de/protox3/,SuperToxic database,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C47824,Cardiotoxicity (CD) dataset,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C47824,Endocrine disruption (ED) dataset,Data - Source Model,bqbiol:hasDataset,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C47824,Acute oral toxicity (AO) dataset,Data - Source Model,bqbiol:hasOutput,NCI Thesaurus OBO Edition NCIT,https://identifiers.org/NCIT:C25338,synthetic accessibility (SA) score,Model-output Model,bqbiol:hasOutput,Experimental Factor Ontology EFO,https://identifiers.org/EFO:0011061,toxicity (Tox) score,Model-output%
We will use the content above to exercise all approaches presented in this post.
With Groovy and Java
In Java
Approach: Read the file line by line in Java and associate it with the standard loops.
//This script strips the first line out of each document and outputs the rest of the document contents unaltered. newline = System.getProperty("line.separator"); for( int i = 0; i < dataContext.getDataCount(); i++ ) { InputStream is = dataContext.getStream(i); Properties props = dataContext.getProperties(i); reader = new BufferedReader(new InputStreamReader(is)); outData = new StringBuffer(); lineNum = 0; while ((line = reader.readLine()) != null) { // Skip first line if (lineNum==0) { lineNum++; continue; } outData.append(line); outData.append(newline); } is = new ByteArrayInputStream(outData.toString().getBytes()); dataContext.storeStream(is, props); }
In Groovy
Approach: Use FileReader.eachLine
with a counter starting at 1. The first line is when the counter equals 1.
new FileReader('data.csv').eachLine { line, number -> if (number == 1) { return // continue } // process the rest of the file from here println "$number: $line" }
With Python
We love working with Python because of its beauty and efficiency.
Using next()
We use the data.csv file to read the contents. This method uses next()
to skip the header and starts reading the file from line 2.
Note: If you want to print the header later, instead of next(f) use f.readline()
and store it as a variable or use header_line = next(f)
. This shows that the header line of the file is stored in next().
with open("data.csv") as f: next(f) for line in f: print(line) f.close()
In particular, CSV files can be handled similarly by using the csv
module to read the entire file into memory.
import csv with open("data.csv", 'r') as r: next(r) #skip headers rr = csv.reader(r) for row in rr: print(row)
Using readlines()
We use the data.csv file to read the contents. This method uses readlines()
to skip the header and starts reading the file from line 2. readlines()
uses the slicing technique. As you can see in the below example, readlines[1:]
, it denotes that the reading of the file starts from index 1 as it skips the index 0. This is a much more powerful solution as it generalizes to any line. The drawback of this method is that it works fine for small files but can create problems for large files. Also, it uses unnecessary space because slice builds a copy of the contents.
f = open("data.csv",'r') # skips the header lines = f.readlines()[1:] print(lines) f.close()
Using islice()
We use the data.csv file to read the contents. This method imports islice
from islice()
, a module in Python. The method takes three arguments. The first argument is the file to read the data, the second is the position from where the reading of the file will start and the third argument is None which represents the step. This is an efficient and pythonic way of solving the problem and can be extended to an arbitrary number of header lines. This even works for in-memory uploaded files while iterating over file objects.
from itertools import islice with open("data.csv") as f: for line in islice(f, 1, None): print(line) f.close()
Voilà!
Summary
Those approaches also work for any type of file. Although we have shown how to treat CSV or TSV files you can apply any of the approaches with your data of interest.
References
[1] How to Remove the First Line of a Document using Groovy, https://community.boomi.com/s/article/howtoremovefirstlineofadocumentusinggroovy, accessed on June 7th, 2024.
[2] How do I use the firstLine argument in eachLine, https://stackoverflow.com/questions/2699865/how-do-i-use-the-firstline-argument-in-eachline, accessed on June 7th, 2024.
[3] How to Read a File from Line 2 or Skip the Header Row?, https://www.studytonight.com/python-howtos/how-to-read-a-file-from-line-2-or-skip-the-header-row, accessed on June 7th, 2024.
[4] Read file from line 2 or skip header row, https://stackoverflow.com/questions/4796764/read-file-from-line-2-or-skip-header-row, accessed on June 7th, 2024.