In SQlite's source code, it has the following comment.
2001 September 15
The author disclaims copyright to this source code. In place of a legal notice, here is a blessing:
- May you do good and not evil.
- May you find forgiveness for yourself and forgive others.
- May you share freely, never taking more than you give.
SQLite is _the_ most amazing database in my opinion. (Not the most amazing distributed database for obvious reasons). It strips away all the connect and network IO management, and focuses on the actual database. It's _the_ most deployed and used database in the world (it's even running outside of the Earth). Because of its wide distribution, it has to be very well tested. It's much harder to fix a bug on client (comparing to on server).
Dr. Hipp is amazing. I wish more people are like him.
> I feel like there must have been something else since 2016?
Funny, that was my feeling as well.
I'm 99% sure to have read a discussion on this here, but all submissions after 2016 (apart from today's) have no comments at all, hence also no comments linking to the older discussion.
Edit: Removed the remark on missing comments. The default search settings do not search for comments.
exactly, what's the mechanism by which this keeps getting posted individually (a time limit I assume)? it almost seems like marketing spam at this point.
> the SQLite library consists of approximately 143.4 KSLOC of C code. (KSLOC means thousands of "Source Lines Of Code" or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 640 times as much test code and test scripts - 91911.0 KSLOC.
How does SQLite compare for bugs in comparison to other DBs? How much are the other players paying out in bug bounties?
I remember reading someone here explain the extensive number of tests the oracle database has and how implementing a feature is more of a waiting for tests to fail game rather than writing code. I couldn't find it now, maybe I'm misremembering.
Especially after reading the MySQL post yesterday[0], I don't know what to expect anymore.
Does anyone have anything to say about the testing story of Postgres?
A lot of the mentioned tools are fuzzers[1], these are tools that start from certain input(s), say "SELECT * FROM TABLE WHERE some_column = some_value", and repeatedly
1- mutates[2] the input,
2- measures some metric of the database's performance (e.g. how many non-syntax errors it spat out)
3- and introduces more or less mutations to increase or decrease that metric.
In short, an evolutionary computation where the population being generated is test code, the fitness function is some metric summarizing how the generated test code tested the db code, and mutation operators gradually pushes the generated test code towards a certain optima that we want our tests to have.
This is an extremely general approach that can be used to generate anything, and it's uncannily effective at finding bugs especially in structured-format consumers like compilers[3] and db engines. You can use it to generate queries as above and the fitness function would be something like "Did the database gave an unholy error squeak or violate some key invariants in the db file?", you can use it to generate general test harnesses that exercise the whole C code and measure coverage via instrumentation, the fitness function would be something like "How much did this mutation covered of the application's instructions?". Not only do they take both of those approaches and many others, but they do it multuple times and independently, so for example it's mentioned that Google and the core development team both have an independent query fuzzer. They mention about 4 or 5 fuzzers doing similar thing.
And off course fuzzing is just one way to generate tests, there are plenty others.
>I strongly doubt it’s all actually source code
I mean, why not? source code is not necessarily hand-written source code.
[2]: Generally, mutation is either completely blind and general, i.e. byte-level , or structure-aware, i.e. has some notion of a grammar or a schema governing the thing it mutates, it wouldn't just change a random letter of "SELECT", because the result would almost cerainly be invalid SQL.
>> the project has 640 times as much test code and test scripts - 91911.0 KSLOC.
This is amazing. One of the reasons some of the developers don’t write extensive test code is because it can cause delays in the deliverables and miss deadlines. How is the sqlite team able to write such a large number of test cases and not miss deadlines?
One reason I suspect is that it's a well defined product. The "spec" is quite well specified at this point. When changes are made existing tests just need to catch regressions, not be changed extensively due to changed meaning of existing primitives. Projects with significant spec/feature churn can't do the same.
SQLite doesn’t really have market pressure, so they can choose any deadline such that it provides adequate time for writing test code — assuming they set deadlines at all.
One thing I was kind of surprised about while reading through this massively humbling post: if there’s a bit flip in a value in a closed DB the value just changes?
Like, there’s no checksum or error correction in an SQLite DB?
That's fairly normal for database systems. E.g. in postgres it's also an optional thing that needs to be enabled, and afaik MySQL doesn't have it either (it has an explicit "calculate a checksum for this data" command, which e.g. can be used to verify replication or backups, but no live checksumming)
I use sqlite every day at work as a web dev. I do a lot of work with file uploaders and being able to import CSV files as tables I can query is a killer feature.
> but don't need clustering or failover strategies.
Even for those cases where you do need it, there are emerging options. You can handle replication 100% in business logic (something I personally enjoy), or you can use a path like dqlite to replicate the physical WAL log.
Why so? I have the impression every developer relies on it for a lot of local stuff, and many libraries also use it locally in users' machines for persistence needs.
I think they have this perspective because SQLite is almost never mentioned when DBs are discussed. Almost everyone pushes postgres, which I think is often a bit insane. Postgres or (often rarely) MySQL is given as the answer to most DB questions, even when coming from clearly new devs.
It's not often that the dev needs any of the large amount of setup/permissions required for postgres/MySQL. The fact that SQLite comes as part of python's standard library should be used as second nature for most devs.
I feel the same way - that SQLite should be the default answer, and one should reach towards postgres or MySQL unless you really *have* to.
SQLite is a file format with a SQL engine which makes it a great embedded database.
It is not an RDBMS in the true sense so in my opinion it is correctly rated.
I tried using SQLite to solve problems which normally would require an RDBMS and ran into so many issues like the lack of enforced static types, built in date or decimal types (and fast aggregations on dates), concurrent writes, etc. It’s only when you’ve gone through this process that you’ll realize that SQLite it not the database you think it is — and SQLite itself is up front about that: https://www.sqlite.org/aff_short.html
Maybe. But we learn from the sheer hard work put into this level of testing. And having this level of testing makes you put a bit more effort into each new line.
The bar has been raised. If someone is to make a better embedded DB, they will have to maintain it at least as well as this.
(PS I feel embarrassed by my own testing failures when reading this).
In SQlite's source code, it has the following comment.
2001 September 15
The author disclaims copyright to this source code. In place of a legal notice, here is a blessing:
- May you do good and not evil.
- May you find forgiveness for yourself and forgive others.
- May you share freely, never taking more than you give.
SQLite is _the_ most amazing database in my opinion. (Not the most amazing distributed database for obvious reasons). It strips away all the connect and network IO management, and focuses on the actual database. It's _the_ most deployed and used database in the world (it's even running outside of the Earth). Because of its wide distribution, it has to be very well tested. It's much harder to fix a bug on client (comparing to on server).
Dr. Hipp is amazing. I wish more people are like him.
It's a great read, but it got posted quite often here. Some of the previous posts have longer comment threads.
https://hn.algolia.com/?q=how+sqlite+is+tested
Past related threads. I feel like there must have been something else since 2016?
How SQLite Is Tested - https://news.ycombinator.com/item?id=633151 - May 2009 (28 comments)
> I feel like there must have been something else since 2016?
Funny, that was my feeling as well.
I'm 99% sure to have read a discussion on this here, but all submissions after 2016 (apart from today's) have no comments at all, hence also no comments linking to the older discussion.
Edit: Removed the remark on missing comments. The default search settings do not search for comments.
exactly, what's the mechanism by which this keeps getting posted individually (a time limit I assume)? it almost seems like marketing spam at this point.
2 replies →
I would say this sums it up:
> the SQLite library consists of approximately 143.4 KSLOC of C code. (KSLOC means thousands of "Source Lines Of Code" or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 640 times as much test code and test scripts - 91911.0 KSLOC.
How does SQLite compare for bugs in comparison to other DBs? How much are the other players paying out in bug bounties?
I remember reading someone here explain the extensive number of tests the oracle database has and how implementing a feature is more of a waiting for tests to fail game rather than writing code. I couldn't find it now, maybe I'm misremembering.
Especially after reading the MySQL post yesterday[0], I don't know what to expect anymore.
Does anyone have anything to say about the testing story of Postgres?
[0]: https://news.ycombinator.com/item?id=18442941
He talks about it in the corecursive[1] podcast. No mention of code bounties anywhere.
https://corecursive.com/066-sqlite-with-richard-hipp/
And how many of those lines of tests are simply autogenerated?
I hope most of them, 91 million lines of code is insane
90 million lines of test code?
That seems ridiculous to me. I can’t imagine what would lead to so many lines, but I strongly doubt it’s all actually source code.
A lot of the mentioned tools are fuzzers[1], these are tools that start from certain input(s), say "SELECT * FROM TABLE WHERE some_column = some_value", and repeatedly
1- mutates[2] the input,
2- measures some metric of the database's performance (e.g. how many non-syntax errors it spat out)
3- and introduces more or less mutations to increase or decrease that metric.
In short, an evolutionary computation where the population being generated is test code, the fitness function is some metric summarizing how the generated test code tested the db code, and mutation operators gradually pushes the generated test code towards a certain optima that we want our tests to have.
This is an extremely general approach that can be used to generate anything, and it's uncannily effective at finding bugs especially in structured-format consumers like compilers[3] and db engines. You can use it to generate queries as above and the fitness function would be something like "Did the database gave an unholy error squeak or violate some key invariants in the db file?", you can use it to generate general test harnesses that exercise the whole C code and measure coverage via instrumentation, the fitness function would be something like "How much did this mutation covered of the application's instructions?". Not only do they take both of those approaches and many others, but they do it multuple times and independently, so for example it's mentioned that Google and the core development team both have an independent query fuzzer. They mention about 4 or 5 fuzzers doing similar thing.
And off course fuzzing is just one way to generate tests, there are plenty others.
>I strongly doubt it’s all actually source code
I mean, why not? source code is not necessarily hand-written source code.
[1]: skim https://www.fuzzingbook.org/ for a fairly good overview
[2]: Generally, mutation is either completely blind and general, i.e. byte-level , or structure-aware, i.e. has some notion of a grammar or a schema governing the thing it mutates, it wouldn't just change a random letter of "SELECT", because the result would almost cerainly be invalid SQL.
[3]: https://embed.cs.utah.edu/csmith/
Databases are extremely complex.
>> the project has 640 times as much test code and test scripts - 91911.0 KSLOC.
This is amazing. One of the reasons some of the developers don’t write extensive test code is because it can cause delays in the deliverables and miss deadlines. How is the sqlite team able to write such a large number of test cases and not miss deadlines?
One reason I suspect is that it's a well defined product. The "spec" is quite well specified at this point. When changes are made existing tests just need to catch regressions, not be changed extensively due to changed meaning of existing primitives. Projects with significant spec/feature churn can't do the same.
> not miss deadlines
SQLite doesn’t really have market pressure, so they can choose any deadline such that it provides adequate time for writing test code — assuming they set deadlines at all.
One thing I was kind of surprised about while reading through this massively humbling post: if there’s a bit flip in a value in a closed DB the value just changes?
Like, there’s no checksum or error correction in an SQLite DB?
That's fairly normal for database systems. E.g. in postgres it's also an optional thing that needs to be enabled, and afaik MySQL doesn't have it either (it has an explicit "calculate a checksum for this data" command, which e.g. can be used to verify replication or backups, but no live checksumming)
I use sqlite every day at work as a web dev. I do a lot of work with file uploaders and being able to import CSV files as tables I can query is a killer feature.
At what point does it just become easier to prove the semantics with e.g. VeriFast or frama-c?
Tangential: I've seen a couple SQLite articles today. Did I miss some cool related news?
I got the impression that this may have been posted because other fellow HN'ers mentioned it in the "How GNU coreutils are tested" thread.
SQLite is the most underrated database of all time.
I know it's popular, but nowhere near to the level it should be.
"SQLite is likely used more than all other database engines combined"
So, like, even more popular than that?!
https://www.sqlite.org/mostdeployed.html
> nowhere near to the level it should be.
I get the impression that it's used exactly as it should be, in systems that need a database but don't need clustering or failover strategies.
> but don't need clustering or failover strategies.
Even for those cases where you do need it, there are emerging options. You can handle replication 100% in business logic (something I personally enjoy), or you can use a path like dqlite to replicate the physical WAL log.
As far as I can tell, there’s far too many file formats out there. I’m sure sqlite should be used more than it currently is.
Why so? I have the impression every developer relies on it for a lot of local stuff, and many libraries also use it locally in users' machines for persistence needs.
What do you think it "should be" and is missing?
I think they have this perspective because SQLite is almost never mentioned when DBs are discussed. Almost everyone pushes postgres, which I think is often a bit insane. Postgres or (often rarely) MySQL is given as the answer to most DB questions, even when coming from clearly new devs. It's not often that the dev needs any of the large amount of setup/permissions required for postgres/MySQL. The fact that SQLite comes as part of python's standard library should be used as second nature for most devs. I feel the same way - that SQLite should be the default answer, and one should reach towards postgres or MySQL unless you really *have* to.
4 replies →
SQLite is a file format with a SQL engine which makes it a great embedded database.
It is not an RDBMS in the true sense so in my opinion it is correctly rated.
I tried using SQLite to solve problems which normally would require an RDBMS and ran into so many issues like the lack of enforced static types, built in date or decimal types (and fast aggregations on dates), concurrent writes, etc. It’s only when you’ve gone through this process that you’ll realize that SQLite it not the database you think it is — and SQLite itself is up front about that: https://www.sqlite.org/aff_short.html
Enforced static types are now available, as of SQLite 3.37.0 which came out last week: https://www.sqlite.org/stricttables.html
Unfortunately SQLite was developed by a religious extremist for use on war ships. (Try to reconcile those two facts...)
Then one day a better embedded db comes along and people forget about sqlite.
Maybe. But we learn from the sheer hard work put into this level of testing. And having this level of testing makes you put a bit more effort into each new line.
The bar has been raised. If someone is to make a better embedded DB, they will have to maintain it at least as well as this.
(PS I feel embarrassed by my own testing failures when reading this).
Since SQLite is in every iOS and every Android device along with the Python standard library, I don't think it will be forgotten quickly.
That DB better inherit those test cases however, those are incredible assets
It can't. Most of that test code is proprietary to the SQLite team and not available to others.
5 replies →
Such as?