GitHub code search is generally available

2 years ago (github.blog)

Surprised more people aren't talking about how Microsoft has a near monopoly on the developer ecosystem. They've got GitHub, OpenAI, and VS Code all working together and collecting data that strengthen each other's products while also using their embrace, extend, extinguish strategy with WSL and all of these steer people towards Azure services whenever possible. Seems like something that verges on an anti-trust situation when you think about the flywheel effect data has for AI

credit to Microsoft for rehabbing their reputation with developers but it seems like a massive trojan horse

  • > more people aren't talking about how Microsoft has a near monopoly on the developer ecosystem.

    But do they? In my day job, outside of the occasional use of Visual Studio and developing on a Windows machine, I use no Microsoft products for development.

    > credit to Microsoft for rehabbing their reputation with developers

    With a fair number of younger developers, but certainly not all. Most devs I know don't think of Microsoft any more kindly now than in the past.

    • Your particular individual experience is not representative of the overall reach that Microsoft has, which is what OP is pointing out.

      GitHub is the defacto "point" of software these days, with most devs jumping to that before anything else.

      VSCode is the highest ranked editor from the StackOverflow 2023 survey, with (IIRC) something akin to 70%.

      Azure is the icing on the cake, because now you have an entire generation of developers building on GitHub, from VSCode, and deploying onto Microsoft infrastructure.

      4 replies →

    • > But do they? In my day job, outside of the occasional use of Visual Studio and developing on a Windows machine, I use no Microsoft products for development.

      According to your comment history, you are using JavaScript or used it in the past at least, which usually means you've used npm, which is owned by Microsoft.

      And since lots of "young" developers use JavaScript and TypeScript, most of them are also interacting with npm.

      Microsoft has captured maybe a larger part of the developer market than you realize. Not by being amazing or with "Microsoft <3 FOSS", but by buying up lots of the market.

    • I don't think they have one today, but it is looking like they could soon have one. Especially since I believe recent reports show that Azure is starting out pace AWS in adoption? So you have .NET, Visual Studio (Code), Azure, Github, OpenAI, Windows, and I am pretty sure more I am forgetting about. I think the big one that wasn't initially mentioned was Azure.

      1 reply →

  • Don't forget TypeScript and npm as well which basically covers 99% of the JavaScript ecosystem if not more.

    Then Dependabot for large swatches of more developers outside of the earlier mentioned ecosystems. LinkedIn for everyone's career.

  • "monopoly on developer ecosystem"

    GitHub? fair

    OpenAI? how is this a part of dev. ecosystem?

    Vs Code? wtf? there's a lot of other IDEs/editors and many would argue that they are better

    >embrace, extend, extinguish strategy with WSL

    They are EEEing their product - Windows?

    • > > embrace, extend, extinguish strategy with WSL

      > They are EEEing their product - Windows?

      No, Linux obviously.

      First they like and integrate Linux into their own products. Azure, WSL and others.

      Then, they provide extensions that are closed-source on top of those.

      With the goal to extinguish the original project so they have more control over the direction.

      5 replies →

  • except windows itself is not loved by most developers, microsoft will take over the developer world when it replaces its windows with linux fully(instead of WSL2, which is nice but not great)

    • Part of the reason Windows isn't loved by developers is also hardware. So a switch to Linux won't fix this, unless they made the switch when Apple was releasing those horrific keyboards!

  • No one is talking about it because developer tools is literally the core business Microsoft is in and was founded as such, 48-years ago.

    Yes, they have expanded in many ways - but developer tools/languages/IDEs/libraries is their bread & butter.

    The famous “Developers Developers Developers” by Steve Ballmer.

    https://youtu.be/Vhh_GeBPOhs

  • I think there was a lot of discussion on this when Microsoft took over GitHub but as time goes people kind of accepted the reality.

  • i think it extends further than that, since they have: vscode, github, linkedin, npm, typescript, chatgpt. for many, this is almost the entire developer ecosystem.

    at a high level they pretend to embrace open source but many of the best features of vscode are closed source, such as remote editing and various language servers (pylance, etc.) the lsp saga is particularly unfriendly, since they pushed it as an open standard, tons of people contributed and adopted it, and then they closed the source to their most valuable language servers, making them only compatible with their product (vscode).

    there are countless similar examples. the way i see microsoft and the way they want to be perceived are entirely different.

  • Don't forget Teams, which seems to be killing every other Chat/Video platform in the enterprise sector.

  • Github -> gitlab, vs code -> jetbrains. why use wsl just go to linux. No one forces you to use their products they’re just a better developer experience imo aside from windows. Plenty of competition in the space. The question is does better equal monopoly?

  • How is WSL an example of EEE? Other than adding a few things like the ability to navigate the Windows filesystem from WSL, they have really done nothing in the way of extending or extinguishing.

    If anything is classic EEE of what you listed it’s GitHub, given it’s basically a series of extensions on an open standard. Its Linux EEE program is Azure where Microsoft sells you a Linux server running in its datacenter in fancy ways that lock you in.

    I feel Microsoft’s monopoly around development is not that strong and it’s not terribly hard to completely avoid them.

  • Count me in the list of dev teams where that is not remotely true. I use exactly 0 microsoft projects at my software engineering job.

I have been using since the beta, truly the most impressive product released in the last 5 years (along with chatgpt). The amount of indexed code, the quickness and the precision of this search is simply stunning.

  • Absolutely. GitHub Code Search is by far the most valuable online development tool I have used the past year. It is so much more useful than Copilot or any of the AI LLMs in my experience.

    With Code Search, I have:

    * Rewritten a CMake build system, which would have been practically impossible without access to real-world examples because of how poorly designed and documented it is;

    * Validated machine-generated translations by looking up language strings from projects that used human translators;

    * Tracked down bugs in unfamiliar codebases, using symbol-based navigation, without the downtime of fetching a bunch of files and waiting for a language server to process them locally;

    * Reviewed how projects were using the APIs of a library I work on to determine whether high-maintenance features were actually used, and whether tricky features were being used correctly or needed redesigning to reduce programmer error

    Kudos to the team at GitHub. Genuinely stellar work.

  • Is the difference that now the basic search functionality actually works?

    The previous standard GitHub search I found to be remarkably bad. I would be looking at some small public repo, search for an exact string match I know exists in the code, scope the search to that repo only, and still see zero results. Even copy-pasting a line of code from a file in the repo often resulted in zero matches.

  • I'm pretty sure sourcegraph.com has been doing all that (search all GH repos, exact search in quotes, case-sensitive search, regex search, limit files using filename regex) for longer than 5 years.

  • Could you elaborate more on why this is a significant feature?

    I can see how it would be handy to search a codebase online, especially one you don't have cloned locally, but for my own codebases, I can search the entire thing just fine in VS Code with ctrl-shft-f.

    • It's not only about searching your own repository, it allows you to search through every single public repository on GitHub. I personally use it a lot to learn more obscure APIs which are badly documented or which I'm just not used to, simply search for the method I'm trying to use and find infinite examples of real world usage, along with the code license right next to it.

      It's also great if you drank the GitHub kool-aid as you can do a single search and find related code snippets, issues, pull requests and discussions that could possibly help. I'm personally not to big into the ecosystem, in fact I'm considering moving to Fossil so I can have everything inside the repo, but for those who are it's a great feature.

    • Ever wanted to use an API and found the documentation to be lacking?

      GitHub code search pretty much solves that. For any API you can find an example of someone else using it.

      I've been using it for this for a year now and I wouldn't want to live without it.

      1 reply →

    • I think others have pretty much summarised it already:

      Searching for non-documented or badly documented API, find implementation of an algorithm or specific pattern, find how people are using a niche tool, etc.

      I have even used it to find my own API in the wild to look for potential breaking changes and improvements to do.

    • For me it's been very useful at finding codebases with work done on extremely specific niche things that would've been near impossible to find otherwise (e.g. tools for obscure protocols hidden away on obtusely named repos)

I'm Colin from GitHub's code search team, happy to answer any questions.

For more info on how we built this, you can check out our technical blog post from a few months ago https://github.blog/2023-02-06-the-technology-behind-githubs...

  • It would be really awesome if code search could one day consume LSIF for precise results in its index similar to source graph. The symbol search is good now, but approximate. Having more precise code search by allowing devs to upload LSIF data in their CI pipelines would allow for precise symbol search (go to definition / find usages actually being accurate) and remove irrelevant result.

    • Great point. Yes, we initially focused on zero-config approximate code navigation. But we do intend to support build-based code navigation in the future, since the approximate code navigation experience can be pretty poor for some languages (e.g. C/C++).

  • I've been using it for a few weeks, and it's a great improvement, congrats on the launch.

    (Pleeeease make the right click context menu just be the normal default menu though, I (apparently) "Right-click > Back" a lot when using a mouse rather than a trackpad and I keep getting that goofy Emoji dialog.)

  • Are you planning to release code search as open source? I can’t find a link to the source code anywhere.

  • I imagine a future in which this is integrated into vscode so I can go from an error message in the terminal to a search through my code + third-party modules that my code is importing

    • You can already do that by including your module folder (eg. node_modules, site-packages etc) in your project tree

  • Any plans to add support for Vue SFC syntax highlighting?

    Edit: correction it looks like that's been fixed since last week, nm

I generally like the new code search, but I've got one big gripe: there's no way to sort code results by any kind of proxy for recency.

The old code search had the ability to sort by indexed date. This wasn't perfect, but it was something.

I like keeping up with who's using my code and whether they're leaving comments or commit chains that outline trouble they're having with it. Sometimes old code pops up in the recently-indexed sort, but if I regularly search and look at the top page, I can see most new uses.

Without it, code search is basically useless for this purpose :/

  • (I work on code search.) Yeah, sorry about that. We've heard this feedback a lot. There's two reasons why we haven't implemented this. First, content is shared between repositories which makes this harder than before, when it wasn't. Second, we rebuild the index weekly or even more frequently, so the proxy of "when was this added" that was used doesn't work any more. What we would like to use is "when was this blob added to this branch" but that's extremely expensive to retrieve from Git because Git trees don't record it.

    • git blame is expensive, especially at scale, on big repos, but the thing to understand is the exact hash something was commmitted at isn't important for time-based indexing. What's wanted is when, ± a few days, something was committed at, which makes for a much cheaper query. (How merges are dealt with might also be material)

      Barring that though, the equivalent of a post-commit git hook that updates the DB with 'when this blob was added to this branch' and then run a backfill-enough job.

      The easy answer, though, would seem to be keep a copy of last week's index, and run the query twice and figure out a way to efficiently compare results to figure out if something is this week's but not last weeks index.

      Also of note, "when was this blob added to this branch" isn't even actually the same as git blame, which means that if a file was touched that matched the search but the latest change to the file doesn't affect the matching line of code, it'd show up as recent, which is not what the user wants.

The feature isn't working well yet on C and C++. If I recall correctly it's based on Tree-Sitter[1] parsing, and there are still too many bugs in corresponding grammars - tree-sitter-c[2] and tree-sitter-cpp[3]. Hopefully, it will be greatly improved in the future as the share of the existing and newly written code in C and C++ is quite significant.

[1] https://tree-sitter.github.io/tree-sitter/

[2] https://github.com/tree-sitter/tree-sitter-c/issues

[3] https://github.com/tree-sitter/tree-sitter-cpp/issues

  • As an Emacs user, the tree-sitter-based Elixir and HEEx modes are vastly superior to the built-in HTML+ and Elixir modes.

I'm happy that at last, my Stack Overflow question and answer have been fully solved with technology improvements!

https://stackoverflow.com/questions/43891605/search-partial-...

It's been almost 6 years, though... for a search scenario that would be trivial to implement with grep (at scale that's another thing...) Still, a nice example of perfect being the enemy of good, I guess.

The time to pull request with a good code search, paired with GitHub.dev and Copilot is really a force multiplier. What a time to be alive.

I don't get the praise over GitHub code search. I find it very inaccurate and often missing references etc. Maybe it depends on the language you are using it with? (Go here)

  • Are you referring to what they just released? Github code search has always been notoriously bad. This is a new search product:

    "today, our new code search and code view are generally available to all users on GitHub.com"

  • Have you tried the new search that was just launched? It seems to have significantly improved search accuracy. I agree the old version wasn't that great (though still was one of the better options for finding usage of things in the wild like rarely used OSS dependencies I needed to debug).

    • I was in the beta, which im assuming was the same. I find it sometimes misses references in the same folder as the file I'm look at when I do reference search.

      1 reply →

I have missed Google Code Search, which launched in 2006 and was discontinued in 2013. Similar to GitHub's code search, it supported searching by regex and filtering by language etc. - but obviously the amount of code to search through is orders of magnitude larger than it was 10-15 years ago. Still I wonder what took GitHub so long to build this - it's hardly a novel idea, and it seems like such an obvious power tool for programmers to have.

It's good.

it's mind bogglingly insanely good compared to the garbage that was there before for years and (I think) still is in Azure DevOps.

Don't get me wrong but how is this any better than say Scintilla searching locally? I feel like the previous one was very bad and this is baseline but please tell me where I am wrong. From a brief test this looks like something you'd have built with Sphinxsearch a decade plus ago?

No way to sort by recency makes the new code search completely unusable. Having to hover the file title to see the updated date is horrible. For some searches, I want to see how developers are solving problems nowadays, not 10 years ago. No more number of matching files per language. Not possible to filter by multiple languages anymore. Such a downgrade.

This is a huge upgrade...but one major annoyance with their new interface is some keyboard shortcuts conflict with browser shortcuts. I'm looking at my repo and I ctrl+f to search the current file and start typing my search query assuming it's working...but it's actually focused on the file browser and started trying to edit the file instead?!?

>Without code search, you might have to clone a bunch of repositories and grep through them

Why is this an issue? I do have a clone of all my company repos, doesn't everyone? Memory is cheap and ripgrep/ag are fast.

  • No, every microservice or new project/ tool get its own repository. There's new ones being created quite frequently, so I have ~10 repositories cloned that most relate to the work I do, but nothing beyond that.

  • Bandwidth is not cheap for many of us.

    • how big are the repos at your company?

      Bitbucket has a 2GB max size on each repo, I'd suspect other providers have similar restrictions. Would be a problem on mobile but repos that hit that limit are rare.

Seems like grep-aaS? Or am I missing something. It could not e.g. get the definition of a C++ function for me. Really useful still, but should have been there years ago.

My *first* test is not very successful.

> Whoa there!

> You have exceeded a secondary rate limit.

> Please wait a few minutes before you try again; in some cases this may take up to an hour.

I initially thought “generally available” refers to removing the login wall. I remember when you didn’t have to sign in to search code on GitHub; I miss that.

  • Probably due to bot abuse. Can't have nice things on the internet.

    Requiring an account makes rate limiting vs botnets easier.

hold on, I paid for copilot, what does github-code-search buy for me? not to mention I can kind of search its code already in the past.