Tags down


How to remove duplicate lines in two large text files by number of appearance?

By : Stefanie Melo
Date : July 30 2020, 02:00 PM
Hope this helps How can you avoid reading 2 × 12 GB into memory at once, but still process all the data?
By loading those 24 GB chunk by chunk, and discarding data you don't need anymore as you go. As your files are line-based, reading and processing line-by-line seems prudent. Having 4000-ish characters in memory at once shouldn't pose a problem on modern personal computers.
code :
with \
        open("A.txt") as a_file, \
        open("B.txt") as b_file, \
        open("AB.txt", "w") as ab_file:
    for a_line, b_line in zip(a_file, b_file):
        # get rid of the line endings, whatever they are
        a_line, = a_line.splitlines()
        b_line, = b_line.splitlines()

        # output the combined content to AB.txt
        print(f"{a_line}\t{b_line}", file=ab_file)

Share : facebook icon twitter icon

Fastest way to remove duplicate lines in very large .txt files

By : Deepak Khandelwal
Date : March 29 2020, 07:55 AM
I hope this helps . You could try a bloom filter. While you may get some false positives (though you can get arbitrarily close to 0% at the cost of more processing) it should be pretty fast as you don't need to compare or even do a log(n) search for each line you see.

Remove all lines after an 4 digit number from a large number of .txt files

By : NeedHelpSchool
Date : March 29 2020, 07:55 AM
this one helps. The Regex 1(?:[4-8]\d\d|900)(?:.|[\r\n])+\z will select a text starting with 1400-1900 till the end of a file.

How to remove duplicate lines from a large text file efficiently?

By : Ho Nhat Tan
Date : March 29 2020, 07:55 AM
Hope this helps Before you write your data, if your data is in a list or dictionary, you could run LINQ query and use group by to group all like keys. Then for each write to the output file.
Your question is a little vague as well. Are you creating a next text file every time and do you have to store the data in text? There are better formats to use such as XML and json

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certai

By : BELOUCH Mustapha
Date : March 29 2020, 07:55 AM
Any of those help You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
code :
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat * | sort -u | split -l1000000 - outfile_
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
cat * | sort -u | split -l1000000 - outfile_

Compare text files in C# and remove duplicate lines

By : P.A.
Date : March 29 2020, 07:55 AM
this will help Here's a simple solution that works for your example files. It doesn't have any error checking for if the file is in a bad format.
code :
using System;
using System.Collections.Generic;

class Program
    class entry
        public string origin;
        public string destination;
        public DateTime time;
        public double price;

    static void Main(string[] args)
        List<entry> data = new List<entry>();

        //parse the input files and add the data to a list
        ParseFile(data, args[0], ',');
        ParseFile(data, args[1], '|');

        //sort the list (by price first)
        data.Sort((a, b) =>
            if (a.price != b.price)
                return a.price > b.price ? 1 : -1;
            else if (a.origin != b.origin)
                return string.Compare(a.origin, b.origin);
            else if (a.destination != b.destination)
                return string.Compare(a.destination, b.destination);
                return DateTime.Compare(a.time, b.time);

        //remove duplicates (list must be sorted for this to work)
        int i = 1;
        while (i < data.Count)
            if (data[i].origin == data[i - 1].origin
                && data[i].destination == data[i - 1].destination
                && data[i].time == data[i - 1].time
                && data[i].price == data[i - 1].price)

        //print the results
        for (i = 0; i < data.Count; i++)
            Console.WriteLine("{0}->{1}->{2:yyyy-MM-dd HH:mm}->${3}",
                data[i].origin, data[i].destination, data[i].time, data[i].price);


    private static void ParseFile(List<entry> data, string filename, char separator)
        using (System.IO.FileStream fs = System.IO.File.Open(filename, System.IO.FileMode.Open))
        using (System.IO.StreamReader reader = new System.IO.StreamReader(fs))
            while (!reader.EndOfStream)
                string[] line = reader.ReadLine().Split(separator);
                if (line.Length == 4)
                    entry newitem = new entry();
                    newitem.origin = line[0];
                    newitem.destination = line[1];
                    newitem.time = DateTime.Parse(line[2]);
                    newitem.price = double.Parse(line[3].Substring(line[3].IndexOf('$') + 1));
Related Posts Related Posts :
  • name 'df' is not defined in box plot
  • Comparing dataframe columns
  • Can I Override Global Authentication for a Single Request Type in an ApiView using DRF?
  • Celery chain performances
  • Why am I getting "asynchronous comprehension outside of an asynchronous function"?
  • Creating a file from a docker container
  • doing too many write operations in django rest framework and postgres
  • How to change the order of bar charts in Python?
  • Pandas Data Frame manipulation
  • an undefined error in a simple python code- KeyError: '284882215'
  • Pandas split column in several columns throug string replacement or regex
  • how value is passed from __init__ method in pyhton as it dose not return anyhting
  • Dynamically inherit all Python magic methods from an instance attribute
  • Asking user to input certain information
  • how to test a deep learning model in a new dataset
  • Is np.fft.fft working properly? I am getting very large frequency values
  • How can you delete similar characters at the same positions in 2 strings
  • Does insert (at the end of a list) have O(1) time complexity?
  • Automatically Creating List of Dictionaries Based Upon Two Lists of Equal Length with Python
  • Discrete Cosine Transform (DCT) Coefficient Distribution
  • multiprocessing.Pool not running on last element of iterable
  • Python: sorting string non lexicographically
  • Render images from media directory Django
  • Cannot understand why more vectorization is slower than less vectorization in this case?
  • Django - Use a property as a foreign key
  • creating a function that loops if you do not enter the correct variables
  • Confused on how to store 3D matrices in HDF5 file in matlab?
  • TOTP: Can someone use the same otp within 30s and misuse it
  • is it possible to have 2 type hints for 1 parameter in Python?
  • Can someone explain what this Numpy array property is called?
  • Better way to add the result of apply (multiple outputs) to an existing DataFrame with column names
  • Selecting choice numbers
  • Create variables from list PYTHON
  • This code takes forever to run but doesn't give an error
  • "return" and "return None" behavior difference in generator
  • AttributeError: 'str' object has no attribute 'fbind' error using kivy in Python
  • Python not importing files when not inside conda environment
  • Is it possible to override a class' __call__ method?
  • Python library for live coordinated plotting in map
  • Pandas: counting consecutive rows with condition
  • How to define that a return type of method is an implementation of superclass
  • How can I print to the Visual Studio Code console in Portuguese?
  • Google Appengine Standard Python 2.7: Can't run Google Endpoints on localhost dev_appserver.py anymore
  • google appengine Unauthorized status 401
  • Don't understand cause of this IndentationError in my tic tac toe script
  • How to read in key-value pair from a json file as a pandas dataframe?
  • Can decorator decorate a recursive function?
  • How do I create a nested for loop where I have control of the initial loop index value
  • Unexpected error when creating a SQLite database using python
  • Pythonic way to write cascading of loops and if statements?
  • Python Beginner - Having trouble with multiple choice quiz program
  • Itertools return value NOT used in combinations
  • Return a list of words that contain a letter
  • From rows to columns using Peewee ORM
  • Parse large text document, to keep only "account number", and a specific keyword ("Market Value")
  • Cannot append to my list without getting a nonetype object error
  • Python Train Test Split
  • Optimizing following Python List of Dictionary operation with better solution
  • In Pandas merge colum1 value with colum2, both col data type is object and only few values are null in first column?
  • Python run multiple background loops independently
  • shadow
    Privacy Policy - Terms - Contact Us © voile276.org