← Back to context

Comment by xyzzyz

8 hours ago

As a matter of fact, you cannot do

  let s = “asd”;
  println!(“{}”, s[0]);

You will get a compiler error telling you that you cannot index into &str.

Right, you have to give it a usize range. And that will index by bytes. This:

  fn main() {
      let s = "12345";
      println!("{}", &s[0..1]);
  }

compiles and prints out "1".

This:

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", &s[0..1]);
  }

compiles and panics with the following error:

  byte index 1 is not a char boundary; it is inside 'ሴ' (bytes 0..3) of `ሴ2345`

To get the nth char (scalar codepoint):

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", s.chars().nth(1).unwrap());
  }

To get a substring:

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", s.chars().skip(0).take(1).collect::<String>());
  }

To actually get the bytes you'd have to call #as_bytes which works with scalar and range indices, e.g.:

  fn main() {
      let s = "\u{1234}2345";
      println!("{:02X?}", &s.as_bytes()[0..1]);
      println!("{:02X}", &s.as_bytes()[0]);
  }

IMO it's less intuitive than it should be but still less bad than e.g. Go's two types of nil because it will fail in a visible manner.

  • It's actually somewhat hard to hit that panic in a realistic scenario. This is because you are unlikely to be using slice indices that are not on a character boundary. Where would you even get them from? All the standard library functions will return byte indices on a character boundary. For example, if you try to do something like slice the string between first occurrence of character 'a', and of character 'z', you'll do something like

      let start = s.find('a')?;
      let end = s.find('z')?;
      let sub = &s[start..end];
    

    and it will never panic, because find will never return something that's not on a char boundary.

    •   Where would you even get them from?
      

      In my case it was in parsing text where a numeric value had a two character prefix but a string value did not. So I was matching on 0..2 which blew up occasionally depending on the string values. There are perhaps smarter ways to do this (e.g. splitn on a space, regex, giant if-else statement, etc, etc) but this seemed at first glance to be the most efficient way because it all fit neatly into a match statement.

      The inverse was also a problem: laying out text with a monospace font knowing that every character took up the same number of pixels along the x-axis (e.g. no odd emoji or whatever else). Gotta make sure to call #len on #chars instead of the string itself as some of the text (Windows-1251 encoded) got converted into multi-byte Unicode codepoints.