Comment by JSeiko

15 hours ago

Hi! I'm one of the programmers at Gutenberg. We've been improving the site a lot over the past few months (and more is coming!). If you haven't visited the page recently, it's worth checking out again: https://www.gutenberg.org/

60 comments

JSeiko

svat 11 hours ago

Have you considered having a detailed version history for each book (etext)? The process of submitting fixes to typos etc in books involves sending an email (https://www.gutenberg.org/help/errata.html) and although the last time I did this (2011) the fixes did get applied reasonably quickly (couple of days), it all felt a bit opaque. The version history could also include the project (usually PGDP correct?) the etext originated from; that way one would be able to compare against the actual page scans.

I have very mixed feelings about Standard Ebooks and would much prefer being able to use Project Gutenberg directly, but one good thing Standard Ebooks does is that every book has an associated git repository (on GitHub), so it's (in principle) possible to see a history of fixes to the text over time.

gluejar 10 hours ago

We're using git repos internally to keep history for each book. They existed on github for a while, but our implementation was awkward, and too big of project for the volunteer dev team. But it's likely that we'll evolve towards that.
marcprux 7 hours ago

> I have very mixed feelings about Standard Ebooks[…]
Why?
JSeiko 10 hours ago

I believe our new-ish CEO Eric Hellman actually did some work on something very similar
JSeiko 11 hours ago

That's an interesting idea. not a small feat to accomplish though ...

jefurii 13 hours ago

When I thought about Project Gutenberg I remembered that original brutalist non-design. The current site has been very tastefully updated but looks like it's still very accessible if you turn styles off. Great job!

JSeiko 13 hours ago
sadly HN doesn't have a "heart" emoji I could use :D
- ricardonunez 7 hours ago
  
  I like the design but liked the previous design as well, it was unique and Craigslistish, you knew what website you were visiting just by looking at it.
- Wistar 13 hours ago
  
  ♡
- ok_dad 10 hours ago
  
  <3
  Less than three is a classic!

lucb1e 11 hours ago

Huh that's interesting: 4.5 seconds for the TCP handshake and an additional 9.2 seconds for the TLS handshake. Is this some kind of captcha, since most bots would disconnect before that, so if you complete it once then it knows you're good? (Until the bots catch on of course, but so long as it works it's relatively unintrusive and not discriminatory against uncommon client software (that is, non-Chrome/ium).) The rest of the requests were lightning fast

Edit: welcome to your first comment after 9 years on HN btw, nice to have you here!

codys 11 hours ago

I think their site is just slow, potentially because more people than they are used to are trying to view it.
I was unable to load it initially (got an error from firefox) and had to re-attempt. Still slow if one forces a reload (shift-r, etc, to not use local cache).
JSeiko 11 hours ago
we are having occasional lows in page speed performance due to LARGE amounts of bot traffic. full disclosure - we've not really been able to resolve this fully/well. Let us know if you have a good idea for how to deal with it
- dimava 7 hours ago
  
  If it's purely bot traffic, then Anubis could help
  You could have seen it on some websites already
  https://anubis.techaro.lol/
  
  1 reply →
- gropo 10 hours ago
  
  Do you host a torrent?
  I have about 50k of the books, I would have used a torrent of just the txt files if it was prominent.
- lucb1e 11 hours ago
  
  I'm only a small-scale sysadmin but the way that I understand the internet is that you send abuse notifications to the IP address block owner and, if it doesn't get resolved, you block. The whois/rdap database reveals which IPs all belong to the same hosting provider or ISP, so you can summarize that all to one list of IP addrs + timestamps per some time period
  The ISP actually knows which subscriber is on that line, can send them notices, block them, terminate them... loads of things that you simply cannot do because you have no relation to this person. And frankly I wouldn't want to need to have a personal relation with every website that I visit; my ISP can reach me if there is anything relevant to continued use of the internet. From personal experience, when I was a teenager, the ISP cutting our household off after an abuse report was an effective way of stopping what I was doing
  
  3 replies →
- TurdF3rguson 11 hours ago
  
  CF cache?

0x0203 9 hours ago

As long as you're taking suggestions, since many of the books are quite old, adding a publication date or date range to the search functionality might be nice. I personally would find it very useful since I have a tendency to look for things that are older than year _x_ when researching various things.

Thanks for all the effort put into the site!

Guestmodinfo 6 hours ago

Hi for the past 20 years I have known about Project Gutenberg and I used to read a lot from it. One of the obstacle that I face is that there is no way to arrange the books in the order of their original publication. Do you know of any such way. Surely we can arrange the books by their release date on Gutenberg but it has long baffled me as it feels to me the most useless way of sorting the books. Thank you for Project Gutenberg.

Falimonda 15 hours ago

The book list elements on front page render as both horizontally and vertically scrollable divs on mobile - seems like an opportunity for improvement.

Keep up the good work!

JSeiko 15 hours ago
good feedback thanks! Doing an iteration on the homepage design is actually pretty high on the priority list. will keep your feedback in mind!
- Falimonda 9 hours ago
  
  Any interest in offering PG as a multi-lingual web e-reader in any language?
  I've since discontinued hosting it, but happy to add you all and merge into an official PG offering: https://www.reddit.com/r/SideProject/s/VtYKxjrMme
  
  1 reply →

xrd 14 hours ago

Thank you for your work. This site is an international treasure.

8bitsrule 3 hours ago

Great project. Are many of the books in a format that can easily be converted into audio? Is there a way to search for them, and information on what software your readers find useful for this purpose?

(Note: A lot of print media these days has switched to far-to-small font-sizes. Less of a problem for (zoomable) digital media, but for many that's still a barrier.)

excitednumber 14 hours ago

Thank you for being one of the best places on the internet

zamadatix 12 hours ago

Thanks for the free work! Project Gutenberg is nice to have :).

On the site I noticed the library boxes have roughly a single extra line causing a scrollbar to appear and the last line to be chopped off https://i.imgur.com/PQ8T0qc.png is there an issues/bug portal to properly submit these kinds of things?

JSeiko 11 hours ago

you can open an Issue at https://github.com/gutenbergtools/gutenbergsite

ExtremisAndy 14 hours ago

Oh, my! This does look nice. Thank you for your hard work!

JSeiko 14 hours ago

Thanks! We're currently working on a design update of the page of any specific book. Should be online soon (next 1-2 weeks or so)

smallnix 14 hours ago

There's a minor bug with chrome in android where the menu will not close when you tap outside the menu or on the menu link/button

JSeiko 14 hours ago

I've messaged the guy who's best suited to fixing this. He'll be on it this weekend
JSeiko 14 hours ago

will open an "Issue" for it

freedomben 12 hours ago

I can't say for project Gutenberg specifically, but in general a huge issue I see is OCR errors. What do you all do to address OCR?

gluejar 12 hours ago
Check out Distributed Proofreaders: https://pgdp.net
- jfengel 10 hours ago
  
  I didn't realized DP was still around. I used to do it quite a bit, 15 years ago, but OCR has improved considerably since then.
lapetitejort 12 hours ago

I uploaded a PDF to archive.org that auto-OCRs with plenty of mistakes. I have found no way of updating the entire stack of documents produced. I wonder if Project Gutenberg is similar

shuvrojit 14 hours ago

Great Work. Thank you. I'm also a programmer. If you are ever short on help, let me know. I would love to contribute.

JSeiko 14 hours ago

https://github.com/gutenbergtools
autocat3 and gutenbergsite are repos responsible for generating gutenberg.org

TimorousBestie 13 hours ago

Wanna let you know you’re doing great work and you have my dream job, thanks to the team for everything!

JSeiko 13 hours ago
it's not my day job. PG is open-source. I'm "just" a contributor
- TimorousBestie 13 hours ago
  
  Oh, right. That makes sense.

openclawclub 6 hours ago

[dead]

nomoreusernames 13 hours ago

[dead]

BiraIgnacio 14 hours ago

Thanks so much for the work you and your team do!

samcollins 15 hours ago

Very cool! Do you have a recommended way for an agent to see an index of the books and epub links?

(I can’t quite tell if that’s an egregious abuse of the site or you’re perfectly fine to share without human eye balls hitting your www?)

jzs 15 hours ago
Now i'm not associated with gutenberg in any form, but they do have a page for offline consumption:
https://www.gutenberg.org/ebooks/offline_catalogs.html
Perhaps you can find the information you are looking for there.
However if you plan on scraping or otherwise hitting them with a ton of traffic, consider at least to donate a good amount for the traffic you cause them. It ain't free after all.
- JSeiko 14 hours ago
  
  Donations are always appreciated ;)
kay_o 15 hours ago

Check out https://www.gutenberg.org/ebooks/offline_catalogs.html
Don't hit the site with agent. The section furtherst bottom machine readable.
samcollins 14 hours ago

Thanks for the answers! Found it:
> All Project Gutenberg metadata are available digitally in the XML/RDF format. This is updated daily (other than the legacy format mentioned below). Please use one of these files as input to a database or other tools you may be developing, instead of crawling or roboting the website.
And strongly consider a donation! (My addition)
https://www.gutenberg.org/ebooks/offline_catalogs.html#the-p...
gluejar 13 hours ago

if what you want is all the text, please use the tarball or data files at https://www.gutenberg.org/cache/epub/feeds
dredmorbius 8 hours ago

Possibly ZIMs is of interest: <https://news.ycombinator.com/item?id=48152200>).
JSeiko 15 hours ago

not yet, but that's not a bad idea imo. Dealing with Ai crawler traffic is definitely a challenge if that's what you were referring to.
ancientcatz 15 hours ago
OPDS?
- gluejar 14 hours ago
  
  OPDS 2.0 coming RSN. email us if you want to test. OPDS 0.x is currently available (not recommended) by adding .opds to the end of a url
e0d075b569cd 15 hours ago

[flagged]