logo
Tags down

shadow

How to merge two GZIP files, while removing the first row from the second file?


By : G Pierre
Date : August 01 2020, 08:00 AM
I think the issue was by ths following , I have two gzip CSV files, each of them has these decompressed contents: , With a minimal change to your code:
code :
function merge_files { 
   cat <(gzcat "$1") <(gzcat "$2" | tail +2) | gzip > "$3"; 
}
function merge_files { 
   (
      gzcat "$1";
      gzcat "$2" | tail +2
   ) | gzip > "$3"; 
}


Share : facebook icon twitter icon

Gzip: merge a set of small files (<64mb) into several larger files (64mb or 128mb)


By : João Edson
Date : March 29 2020, 07:55 AM
hop of those help? After experimenting a bit further, the following two steps do what I want:
code :
zcat small-*.gz | split -d -l2000000 -a 3 - large_
gzip *

Linux merge huge gzip files and keep only intersection


By : Terry the Diamond Cr
Date : March 29 2020, 07:55 AM
Hope that helps I'd do it in Python.
Read the main file into memory and make a dict out of it (use name_id as key). Then stream each info.gzip file and extend the information in the dict according to what you find. (Consider what to do if you find information for a line more than once.)
code :
#!/usr/bin/env python

import gzip
from collections import OrderedDict

mainData = OrderedDict()  # or just {} if order is not important
with open('main.txt') as mainFile:
  pos = None
  for line in mainFile:
    elements = line.split()
    if pos is None:
      pos = elements.index('name_id')
      mainHeaders = elements
    else:
      mainData[elements[pos]] = elements

infoHeaders = None
for infoFileName in [ 'chr1.info.gz', 'chr2.info.gz' ]:
  with gzip.open(infoFileName) as infoFile:
    pos = None
    for line in infoFile:
      elements = line.split()
      if pos is None:
        pos = elements.index('rs_id')
        if infoHeaders is None:
          infoHeaders = elements
        else:
          if infoHeaders != elements:
            print "headers in", infoFileName, "mismatch"  # maybe abort?
      else:
        key = elements[pos]
        try:
          mainData[key] += elements
        except KeyError:
          pass  # this key does not exist in main

with gzip.open('main.all.gz', 'w') as outFile:
  outFile.write(' '.join(mainHeaders + infoHeaders) + '\n')
  for key, value in mainData.iteritems():
    outFile.write(' '.join(value) + '\n')
number maf effect se pval name_id use pos rs_id a1 a2 a3 a4
34 0.7844 0.2197 0.0848 0.009585 snp1 t 13303 snp1 0 0 0 0
78 0.6655 -0.1577 0.0796 0.04772 snp2 g 10612 snp2 0 0 0 0

Why is seeking from the end of a file allowed for BZip2 files and not Gzip files?


By : user2390150
Date : March 29 2020, 07:55 AM
hop of those help? In simple terms, gzip is a stream compressor, which means that each compressed element depends on the previous one. Seeking would be pointless, because whole file would have to be decompressed anyway. Probably the authors of gzip.py assumed it is better to raise an error instead of silently decompressing the file, so that the user can realize that seeking is inefficient.
On the other hand bzip2 is a block compressor, each block is independent.

Read N files and merge them in a single dataframe, removing the first two column for N-1 files


By : SHIVA KUMAR
Date : March 29 2020, 07:55 AM
Any of those help Hmmm, it says I shouldn't answer my question but since I found a solution and I got no working answer, I thought I would share it. Maybe it will be useful for someone else, or it will prompt better answers.
code :
read_MyView_exports <- function(directory, filenames = NULL) {
    ## 'directory' is a character vector of length 1 indicating
    ## the location of the MyView txt exports

    ## 'filenames' is an optional character vector specifying the filenames 
    ## to be read

    ## Return a data frame containing all the files read

    ## make a list of txt files to be read
    if (is.null(filenames)) {
        filenames <- dir(path = directory, pattern ="\\.txt.*?")
    }
    filenames <- paste(directory, filenames, sep="/")    

    ## read first file
    df1 <- read.table(filenames[1],
                      sep="\t",           # data in MyView export are tab-separated
                      colClasses = "character")
    ## eliminate the column with measurement units
    df1[,2]=NULL

    ## when reading the other files, we need to skip the first two columns, since
    ## they contain varnames and units, so we define a useful wrapper for read.table
    read.and.skip <- function(file){
        df <- read.table(file, sep="\t", colClasses = "character")
        df[,-2:-1,drop=FALSE]
    }


    ## read data from the files in filenames
    alldata <- lapply(filenames, read.and.skip)

    ## merge the list of data frames in a single data frame
    data=do.call("cbind",alldata)
    data=cbind(df1, data)

}

Why angular-cli webpack in folder dist (-prod) has gzip files as well as not gzip files


By : Ashish Patil
Date : March 29 2020, 07:55 AM
wish helps you Update
Gzip generation has been removed the Angular CLI because it was confusing to many people, and now the CLI only outputs the files you actually use.
Related Posts Related Posts :
  • Failing to run external Bash program — /usr/bin/bash: bad interpreter: No such file or directory
  • Parsing data from function in POSIX
  • exclude directories with the same name ending
  • Piping into grep using cat returns error: no such file or directory
  • Why echo of combined variables in for loop doesn't work?
  • Bash: how to look for the next positional parameter in "case" code without hard-coding its value?
  • Can I pipe the pathname expansion to a command?
  • How to sort by numbers that are part of a filename in bash?
  • Removing character(s) from multiple file names
  • awk print "matched" after successful match
  • Best way to find value of a Bash variable in another file without "source"?
  • GNU parallel - how to handle column separator without spaces or new line?
  • awk: remove multiple tabs between each fields and output a line where each field is separated by a single tab
  • bash: how loop through array of dir paths and remove their certain parts
  • Sort and count number of occurrence of part of a filename with "ls"
  • The cat, the echo, and the process substitution
  • cannot use filter on grep to search for all path including a string
  • bash function fails with set -e
  • Watch and wc not yielding any results
  • Compare two version of zip file and find which file has been modified within that zip
  • Convert eGrep to Grep
  • How to use the value in a file as input for a calculation in awk - in bash?
  • Stuck with bash script
  • Remove duplicate lines in braces
  • Functions containing bash variables inserted into perl strings through variables passed through the intiiation of a func
  • Use pipe ("|") as first symbol in bash or zsh command
  • List full lines based on single field much faster than grep
  • shadow
    Privacy Policy - Terms - Contact Us © voile276.org