Comment by tzot

9 hours ago

Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:

    import re, operator
    def count_words(filename):
        with open(filename, 'rb') as fp:
            data= memoryview(fp.read())
        word_counts= {}
        for match in re.finditer(br'\S+', data):
            word= data[match.start(): match.end()]
            try:
                word_counts[word]+= 1
            except KeyError:
                word_counts[word]= 1
        word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
        for word, count in word_counts:
            print(word.tobytes().decode(), count)

We could also use `mmap.mmap`.

7 comments

tzot

akx 7 hours ago

This doesn't do the same thing though, since it's not Unicode aware.

    >>> 'x\u2009   a'.split()
    ['x', 'a']
    # incorrect; in bytes mode, `\S` doesn't know about unicode whitespace
    >>> list(re.finditer(br'\S+', 'x\u2009   a'.encode()))
    [<re.Match object; span=(0, 4), match=b'x\xe2\x80\x89'>, <re.Match object; span=(7, 8), match=b'a'>]
    # correct, in unicode mode
    >>> list(re.finditer(r'\S+', 'x\u2009   a'))
    [<re.Match object; span=(0, 1), match='x'>, <re.Match object; span=(5, 6), match='a'>]

contravariant 7 hours ago
There's bound to be a way to turn a stream of bytes into a stream of unicode code points (at least I think that's what python is doing for strings). Though I'm explicitly not volunteering to write the code for it.
- est 4 hours ago
  
  import mmap, codecs from collections import Counter def word_count(filepath): freq = Counter() decode = codecs.getincrementaldecoder('utf-8')().decode with open(filepath, 'rb') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm: for chunk in iter(lambda: mm.read(65536), b''): freq.update(decode(chunk).split()) freq.update(decode(b'', final=True).split()) return freq
- zahlman 4 hours ago
  
  Sure, but making one string from the file contents is surely much better than having a separate string per word in the original data.
  ... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.
  In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.
est 4 hours ago
OP's .split_ascii() doesn't handle U+2009 as well.
edit: OP's fully native C++ version using Pystd
- zahlman 4 hours ago
  
  Hmm? Which code are you looking at?

contravariant 7 hours ago

For reasons I never quite understood python has a collections.Counter for the purpose of counting things. It's a bit cleaner.