Show HN: Answer Overflow – Indexing Discord content into the web

3 years ago (answeroverflow.com)

Hi!

I'm Rhys, I develop Answer Overflow a search engine for Discord channels. Answer Overflow indexes content from channels into Google making them discoverable on the web.

I'm sharing this again after seeing a lot of discussion during the Reddit blackout about the inaccessibility of information sent in Discord servers.

Answer Overflow is a verified bot in over 100 communities, fully complies with the Discord ToS, and is open source! https://github.com/AnswerOverflow/AnswerOverflow

Check out some of the communities here!

T3 Community - https://www.answeroverflow.com/c/966627436387266600

C# - https://www.answeroverflow.com/c/143867839282020352

Reactiflux - https://www.answeroverflow.com/c/143867839282020352

All - https://www.answeroverflow.com/browse

Please let me know what feedback you have, thanks for checking it out!

Genuine question: I love Discord, but how on earth is it possible that such functionality was not built-in to begin with?

I really don't understand how the need for indexing and search was overlooked.

  • I think it's due to how Discord evolved as a platform

    Discord start as "your private place for your friends to talk" during a time where there were a lot of privacy issues with other communication methods.

    Then as it grew beyond this scope of being a private place for friends, it would have been good for indexing to be added but indexing a normal text channel is really hard since you don't know where the conversation starts / stops to submit to a sitemap.

    Now we've got large public communities and forum channels so it's possible they roll out their own version soon, but it does still slightly go against how their product was originally created so there may be some hesitation with adding it due to not knowing what the community reaction will be like.

    • >Discord start as "your private place for your friends to talk" during a time where there were a lot of privacy issues with other communication methods.

      Discord started as a way for gamers to chat with one another. Initially the developers even wanted to sell games directly from the platform [1].

      I think it would be incorrect to position Discord as a privacy-oriented platform when the desktop client needs to be run in a sandbox because there's no real way to disable data collection.

      1. https://www.pcgamer.com/the-discord-game-store-is-now-open/

      2 replies →

    • It's beyond ironic that Discord called itself that, being the successor to OpenFeint and its privacy lawsuit scandal, and being proprietary.

      Now it is one of the most privacy-hostile AND preservation-hostile platforms around.

      12 replies →

  • It makes no sense to index the vast majority of content. You would need to cherry pick really hard among all the noise to find the stuff worth putting online.

    • I would argue it makes no sense to index the vast majority of content without good search. If your search is good enough, you can index everything and then surface only the good stuff at query time.

      2 replies →

    • Interesting comment. I would think Reddit is similar in terms of content, yet “site:reddit.com <query>” is common as a general search pattern (pre-blackout)

      2 replies →

  • Discord has 'indexing' and search, just like how Slack does. It's just not on the public & open web - only searchable inside of Discord.

  • > I really don't understand how the need for indexing and search was overlooked.

    It wasn't overlooked. The point is to make it difficult for outside users to access information unless they sign up.

  • What I wonder is why would anyone that cares about archiving/search would choose to use Discord?

    • Apparently because it is very easy to setup and offer a place where people can join.

      More and more open source projects are using it and I don't really like it, but what easy alternatives can you recommend to them?

      Genuine question, as it is an open issue for me. I want to focus on my project, not setting up and maintain a forum, mailing lists, etc. on top of that.

      8 replies →

  • Discord is a chatroom first. What non-enterprise chat comes with archives?

    A forum is totally different.

    And even then, forums weren’t designed to be archived from the start. People just wrote web crawlers and search engines.

    (I know Discord has some forum-like functionality now but the point stands.)

  • Discord does have search, but I really hope they do not improve it.

    The lack of good search really prevents the hostility towards new users that you often see on Reddit/forums where every question is instantly answered by a one liner "use the search" reply.

    Discord communities are some of the most friendly and welcoming communities I have ever encountered on the internet. I think a large part of it is the chat nature and inability to easily pull up old comments.

Rhys - are you sure the consent functionality is working? I'm seeing indexed posts by users who are in a time zone that makes it very unlikely they have consented in the last hour or so.

The one user whom I contacted said they had never clicked the green consent button.

EDIT - turns out those posts were only visible to me when I was logged in to both sites (which makes sense).

It wasn't obvious this was the case and checking incognito shows things correctly.

  • Glad we got this resolved and it was all working properly, the site does need to do more to make it clearer when viewing a private message while signed in added it to the backlog sorry about that!

While I see the value here, I don't really think most Discord communities are appropriate to be indexed. It breaks the whole cozy web aspect of it. [1]

[1] https://maggieappleton.com/cozy-web

  • The "cozy web" is out of control these days. A lot of social utility is lost by default because everyone uses Whatsapp and Discord and other such information black holes, places where knowledge goes to die. It's OK if you're using these to chat with your family or friends, but it's kind of... less OK, when every open source project these days, including major programming languages, tells you to join their Slack or Discord for support and learning.

    What's happening is that these "communities" demand you to commit first, and deny providing value to passive participants. If that sounds reasonable to some, let me point out that the entire value of the Internet is built on doing the opposite. Wikipedia, Reddit, StackOverflow, everything that you can find through a search engine - those are all resources made available by people and groups that, for various reasons, decided to share knowledge instead of hoarding it, invite passive participation instead of demanding active commitment. The good days of the Internet, the ones people mourn, back before it got fully commercialized? They were built on the sentiment of openly sharing information, giving them "pay it forward" style - not gate-keeping them in webs of trust, and/or demanding people to pay with effort.

    Maybe I'm too old, but I hate the "cozy web" with passion.

    • I was an active participant of the 90s web, and in fact a lead editor and forum moderator for a popular turn of the millennium news site, so I understand the frustration you're sharing.

      That said, I'd argue it's not the "cozy web" that's out of control, but instead the "dark forest" that has forced the creation of the cozy web. The cozy web is the only bastion of the internet left where there's still some semblance of the pay it forward community aspect of the early web.

      Yes, it is at the cost of not being indexed, but it's the only way of having the genuine sorts of conversations and creation with people of shared interests that typified the early web now.

    • > Maybe I'm too old, but I hate the "cozy web" with passion.

      I don't know if you remember the net/web split but that's exactly what it felt like. Net people would crap on port 80, demand you install a news client and add some byzantine undocumented header or join an IRC channel and send custom DCC commands. There was also a lot of gatekeeping and making fun of the normies ("I may be a nerd but look at Bill Gates, one day I'll be your boss.")

      It was a culture I really didn't enjoy and I mostly stayed out of because everyone seemed so interested in exclusivity. Not too many people seem to remember those communities either which says a lot.

      2 replies →

    • I am an advocate for knowledge sharing and have previously contributed (a tiny amount) to the community mentioned above, Reactiflux. There, I was able to share my knowledge freely without fear of being penalized or judged through a voting system, or being heavily moderated as is the case with Wikipedia or StackOverflow. I also didn't have to worry about my contributions being eternally indexed on the internet. As a contributor, this is a feature (much less so for the lurker).

      On that note, I recently had to request a deletion from Internet Archive because I shared content on my personal website that violates a ToS (it's a Slack archive that I have already anonymized). Unsurprisingly, my request went unanswered.

      1 reply →

    • I agree, but the old web did have some information black holes, like IRC, unarchived mailing lists, junky forums, and more I probably can't remember

  • Most Discord communities aren't meant to be indexed I agree! Thanks for linking that article it was interesting to read

    There's lots that have support channels though for programming libraries, for games, etc and having all of that content locked away can be really damaging.

    One of the interesting things I've noticed is when a community for a more niche game / programming library joins Answer Overflow, they often shoot up to being top performers on the site which is great to see.

    Along with that, not all channels are indexed, mainly just help channels. What's nice with this is it keeps that cozy feeling of a private place to talk, while helping more people find a community they will enjoy and keeping information accessible.

    Long term, I'd like to implement forms of anti-abuse tools for communities to use so they can understand what the types of people who join their server from Answer Overflow are like. For example, if it turns out that 90% of the people who join are abusive, then it'd make sense for them to turn off indexing.

    You could possibly make the argument that for the long term health of some communities, having indexed content helps to keep the community active

    • Thanks for the thoughtful response. Glad to see this is something you care about preserving.

      Good to see you're careful to only share particular channels.

      I have more thoughts on marketing this and also on guidelines for server administrators implementing search indexing. For marketing, most importantly, it could be good to make it clear you're focused on selective sharing only of channels which it would be a public good to make indexable. For administrator guidelines, most importantly, I think there should be several measures to ensure that users are aware of and agree to having their communications in particular channels publicly indexed.

      I ran this by GPT-4 for some more context and detail. [1]

      I think with measures like this we may be able to realize the good of indexing without going too far to driving away the safety of the walled garden aspect of Discord.

      As an aside, for users of existing Discords, I encourage you to learn to use the search features built into Discord. Discord itself indexes servers and the search has good filtering functionality. I suspect if you already know which Discord server has the information you're looking for, you'll have a better experience with the internal search than trying to lean on Google.

      If you want to do better than the internal search, perhaps creating a vector store of the channel and setting up an AI chat application in front of it would be a solution.

      [1] https://chat.openai.com/share/254632c2-c25b-4299-88c9-2ce49e...

  • Most discord communities that are big enough to get indexed were supposed to be forums anyway, or part of one.

Indexing Discord is going to be tough. The reason is that context is all over the place:

Question in one message. Then two unrelated messages. Then a partial answer by somebody. And so on.

It’s even worse than indexing a PDF. Just breaking stuff into paragraphs and generating embeddings isn’t going to cut it.

  • I imagine this will only work (and only index) threads. So the context can be gathered from the thread title/body and underlying messages reflect the discussion.

    Some communities I'm in have #support channels which only support threads. So you create a thread, add a title and a body message and people can reply to your thread by clicking on it. There's no way to post individual messages; only comments in threads.

    Thread overview: https://i.imgur.com/jfvrRtG.png

    Opening a thread: https://i.imgur.com/pqGrARI.png

    This solves your context problem. Still not sure if this is the right direction we want to go in. This just proves to me that Discord is not right tool for the problem at hand.

Welcome back. How does this compare to Linen (https://github.com/linen-dev/linen.dev#readme), which claims to support Slack and Discord? I do see the license difference, but didn't know if that was the major differentiator

  • Couple key differences:

    - Answer Overflow works on a consent basis for displaying messages (https://docs.answeroverflow.com/user-settings/displaying-mes...), while Linen does all the messages in a community. The consent system Answer Overflow has helps a lot with respecting user privacy while also getting content indexed.

    - Linen appears to be building out a competitor to Slack & Discord while Answer Overflow is focused on building on top of those platforms, so we've got very different roadmaps. From what I can gather from the Linen roadmap, they're implementing things like voice chat, private channels, etc. Whereas with Answer Overflow some of the things I'm focused on is answer automation, tracking outdated answers, analytics for where to improve your docs etc

    - Answer Overflow is pretty much only focused on Discord servers, it wouldn't be too hard to support both Slack and Discord but what's nice about focusing on Discord for now is it helps with our goal of being the best indexing tool specifically for Discord

    - Global search (https://www.answeroverflow.com/search), you can search all Answer Overflow communities at the same time

    The team at Linen have built out a great product though and it's cool watching them succeed with it!

People who give their consent to Discord to host their writing don't necessarily do so for third parties. Isn't there a copyright issue here?

This is awesome and timely!

I’ve been wanting to set something like this up for the nullbits server for a while. When I picked discord instead of a forum, I wasn’t counting on the growth we saw. There’s a lot of friction for new folks who aren’t yet on discord, and there’s a lot of knowledge in the server that’s locked behind discord.

Just set everything up! My only feedback is that enabling indexing for all of our text channels took a while doing them all individually, but that’s kind of on me for not enabling forums for help requests until now.

  • Welcome to the Answer Overflow community! I agree it'd be good to have a quicker way to setup multiple channels - to be honest it's kind of far in the backlog as it's pretty rare a server has many, but the UX could be improved there

    If you have any other feedback, please send it to me on Discord so I make sure I see it - thanks!

There are several issues with surfacing search results from Discord as mentioned before in the thread, and even if all of them are resolved the biggest one remains relevance.

Unless a general purpose web search engine introduces a special Discord 'tab', like Images/News/Videos already exist, there is no way for a search engine to assign relevance to anything said on Discord because there is no authority or link graph based credibility for any message. In other words a mention of 'blue widgets' on Discord is competing with milions of web pages mentioning 'blue widgets' which all have some kind of built in relevance. If the idea is that this will be achieved through people linking to an aggregrator like this website, then perhaps, but the approach does suffer from the chickien and the egg problem.

  • I'm mostly interested in surfacing content on pretty specific topics with clear keywords.

    But also either answeroverflow.com will gain some domain authority over time, or the communities will be hosted on domains that already have some.

I can imagine obvious use cases for data surveillance, osint and so on But happy to see implementation of a semantic search engine powered by LLM

I was talking about needing a solution like this just a second ago. Down from the heavens, descends this. I'll be sure to give it a try!

  • Me too! I am trying to build a Discord-based remote course and am excited to read through the code here and see if it matches my needs, or can be tweaked to so.

    Once I do that I'd like to DM you with some questions mid-kid.

    Nice job on getting so much implemented and open for users!

  • Send me a message if you have any questions! Happy to help with getting it setup

    • This might sound a little bit picky, but from a cursory look around the project, it feels a bit too corporate and platform-ey for my tastes. I'm only interested in two things: generating (ideally static, and seo-friendly) web pages out of a discord forum channel and selfhosting it so we can archive the data ourselves (and won't be bound to content policies of answeroverflow.com). All of the extra bells and whistles with the bot auto-managing channels, analytics, AI and whatever else superfluous and make me sweat a little, as I'll have to comb through the documentation to make sure everything is set up correctly. It's also really a shame to read that selfhosting will be a "Pro" feature. I'll give props for considering users wanting to opt-out, however, and it does at least seem rather simple to set up.

      3 replies →

Cool idea, There have been cases where I had to create a burner account just to access a Discord community and its walled content.

If this takes off you may very well get a letter from Stack Overflow lawyers over the name. It's your choice if you want to take that risk, but just FYI.

(And to be honest, I think they would be justified too; I initially assumed it was related to Stack Overflow based on the title. but turns out it's not – this is the sort of confusion trademarks are intended to protect).

  • Under their own guidelines it's fine https://stackoverflow.com/legal/trademark-guidance

    > Do name your application with something unique. Including one of the terms, "Stack" or "Exchange" or "Overflow" in your product name is generally okay.

    It's a different enough product that I feel comfortable with it - Stack Overflow is only for programming while Answer Overflow is for all topics. Along with that Overflow is a pretty generic word and if you wanted to get super technical with it, the context I'm using the word in is "I have so many answers they're overflowing" while theirs is a reference to a programming term.

    We'll see and I'm not a lawyer but given that their trademark guidelines allow it, I feel comfortable

    • That part specifically refers to things built on the Stack Overflow API. And "generally okay" is of course hardly a guarantee. "Overflow" as a word is fine, obviously, but it does sit in the same "get answers" space – I can name my restaurant "Best Apple", but I'll have more problems if I named some piece of electronics "Best Apple".

      It's your site, you can do what you want with it and you're free to ignore my comment – that's fine! But personally, I wouldn't have named it Answer Overflow.

It would be useful if clicking an image opened it in an imagebox or expanded it inline.

Um, why Google? So your indexes can be polluted with their shitty advertising? Why not expose your index as a service? I mean really, WTF not?

I'm sure Discord and their communities are absolutely ecstatic about opening up the doors to openAI and others to scrape their collective work for the latest LLM.

Walled gardens are going to get a whole lot stricter.