Comment by tzot
11 hours ago
Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:
import re, operator
def count_words(filename):
with open(filename, 'rb') as fp:
data= memoryview(fp.read())
word_counts= {}
for match in re.finditer(br'\S+', data):
word= data[match.start(): match.end()]
try:
word_counts[word]+= 1
except KeyError:
word_counts[word]= 1
word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
for word, count in word_counts:
print(word.tobytes().decode(), count)
We could also use `mmap.mmap`.
This doesn't do the same thing though, since it's not Unicode aware.
There's bound to be a way to turn a stream of bytes into a stream of unicode code points (at least I think that's what python is doing for strings). Though I'm explicitly not volunteering to write the code for it.
Sure, but making one string from the file contents is surely much better than having a separate string per word in the original data.
... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.
In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.
OP's .split_ascii() doesn't handle U+2009 as well.
edit: OP's fully native C++ version using Pystd
Hmm? Which code are you looking at?
For reasons I never quite understood python has a collections.Counter for the purpose of counting things. It's a bit cleaner.