Skip to content

Strings and encodings

String

In Java, String is a reference type, which itself is a class . However, the Java compiler has special processing for String , that is, you can directly use "..." to represent a string:

java
String s1 = "Hello!";

In fact, the string is represented by a char[] array inside String . Therefore, it is also possible to write it as follows:

java
String s2 = new String(new char[] {'H', 'e', 'l', 'l', 'o', '!'});

Because String is so commonly used, Java provides the "..." string literal representation method.

An important feature of Java strings is that strings are immutable . This immutability is achieved through the internal private final char[] field and the absence of any method to modify char[] .

Let's look at an example:

java
public class Main {
    public static void main(String[] args) {
        String s = "Hello";
        System.out.println(s);
        s = s.toUpperCase();
        System.out.println(s);
    }
}

Based on the output of the above code, try to explain whether the string content has changed.

String comparison

When we want to compare whether two strings are the same, we need to pay special attention to whether we actually want to compare whether the contents of the strings are the same. The equals() method must be used instead of == .

Let's look at the following example:

java
public class Main {
    public static void main(String[] args) {
        String s1 = "hello";
        String s2 = "hello";
        System.out.println(s1 == s2);
        System.out.println(s1.equals(s2));
    }
}

On the surface, the two strings are both true when compared with == and equals() , but in fact it is just that the Java compiler will automatically treat all the same strings as an object and put them into the constant pool during compilation. Naturally, the references of s1 and s2 are the same.

Therefore, it is purely coincidental that this == comparison returns true . Written differently, the == comparison will fail:

java
public class Main {
    public static void main(String[] args) {
        String s1 = "hello";
        String s2 = "HELLO".toLowerCase();
        System.out.println(s1 == s2);
        System.out.println(s1.equals(s2));
    }
}

Conclusion: To compare two strings, you must always use equals() method.

To ignore case comparisons, use equalsIgnoreCase() method.

The String class also provides a variety of methods to search for and extract substrings. Commonly used methods are:

java
// Does it contain substrings?:
"Hello".contains("ll"); // true

Note that the parameter of the contains() method is CharSequence instead of String , because CharSequence is an interface implemented by String .

More examples of searching for substrings:

java
"Hello".indexOf("l"); // 2
"Hello".lastIndexOf("l"); // 3
"Hello".startsWith("He"); // true
"Hello".endsWith("lo"); // true

Example of extracting substrings:

java
"Hello".substring(2); // "llo"
"Hello".substring(2, 4); "ll"

Note that the index numbers start from 0 .

Remove leading and trailing whitespace characters

Use the trim() method to remove leading and trailing whitespace characters from a string. Whitespace characters include space, \t , \r , \n :

java
"  \tHello\r\n ".trim(); // "Hello"

Note: trim() does not change the content of the string, but returns a new string.

Another strip() method can also remove leading and trailing whitespace characters from a string. The difference between it and trim() is that the Chinese-like space character \u3000 will also be removed:

java
"\u3000Hello\u3000".strip(); // "Hello"
" Hello ".stripLeading(); // "Hello "
" Hello ".stripTrailing(); // " Hello"

String also provides isEmpty() and isBlank() to determine whether the string is empty and blank string:

java
"".isEmpty(); // true,Because the string length is 0
"  ".isEmpty(); // false,Because the string length is not 0
"  \n".isBlank(); // true,Because it only contains whitespace characters
" Hello ".isBlank(); // false,Because it contains non-whitespace characters

replace substring

To replace a substring in a string, there are two ways. One is to replace based on characters or strings:

java
String s = "hello";
s.replace('l', 'w'); // "hewwo",All characters 'l' are replaced with 'w'
s.replace("ll", "~~"); // "he~~o",All substrings "ll" are replaced with "~~"

The other is through regular expression replacement:

java
String s = "A,,B;C ,D";
s.replaceAll("[\\,\\;\\s]+", ","); // "A,B,C,D"

The above code uses regular expressions to uniformly replace matching substrings with "," . We will explain the usage of regular expressions in detail later.

split string

To split a string, use the split() method and pass in a regular expression:

java
String s = "A,B,C,D";
String[] ss = s.split("\\,"); // {"A", "B", "C", "D"}

Concatenate strings

To concatenate strings, use the static method join() , which concatenates string arrays with the specified string:

java
String[] arr = {"A", "B", "C"};
String s = String.join("***", arr); // "A***B***C"

Format string

Strings provide formatted() method and format() static method, which can pass in other parameters, replace placeholders, and then generate a new string:

java
public class Main {
    public static void main(String[] args) {
        String s = "Hi %s, your score is %d!";
        System.out.println(s.formatted("Alice", 80));
        System.out.println(String.format("Hi %s, your score is %.2f!", "Bob", 59.5));
    }
}

There are several placeholders, and several parameters will be passed in later. The parameter type must be consistent with the placeholder. We often use this method to format information. Commonly used placeholders are:

  • %s : display string;
  • %d : Display integer;
  • %x : Display hexadecimal integer;
  • %f : Display floating point numbers.

Placeholders can also be formatted, for example, %.2f means displaying two decimal places. If you are not sure what placeholder to use, always use %s because %s can display any data type. To see the complete formatting syntax, please refer to the JDK documentation .

type conversion

To convert any primitive or reference type to a string, use the static method valueOf() . This is an overloaded method, and the compiler will automatically choose the appropriate method based on the parameters:

java
String.valueOf(123); // "123"
String.valueOf(45.67); // "45.67"
String.valueOf(true); // "true"
String.valueOf(new Object()); // Similar to java.lang.Object@636be97c

To convert strings to other types, you need to consider the situation. For example, convert a string to type int :

java
int n1 = Integer.parseInt("123"); // 123
int n2 = Integer.parseInt("ff", 16); // Convert to hexadecimal,255

Convert string to boolean type:

java
boolean b1 = Boolean.parseBoolean("true"); // true
boolean b2 = Boolean.parseBoolean("FALSE"); // false

Pay special attention to the fact that Integer has a getInteger(String) method, which does not convert the string to int , but converts the system variable corresponding to the string to Integer :

java
Integer.getInteger("java.version"); // version 11

Convert to char[]

String and char[] types can be converted to each other by:

java
char[] cs = "Hello".toCharArray(); // String -> char[]
String s = new String(cs); // char[] -> String

If the char[] array is modified, String will not change:

java
// String <-> char[]
public class Main {
    public static void main(String[] args) {
        char[] cs = "Hello".toCharArray();
        String s = new String(cs);
        System.out.println(s);
        cs[0] = 'X';
        System.out.println(s);
    }
}

This is because when a new String instance is created through new String(char[]) , it will not directly reference the incoming char[] array, but will make a copy. Therefore, modifying the external char[] array will not Affects the char[] array inside String instance because these are two different arrays.

It can be seen from the immutability design of String that if the incoming object may change, we need to copy it instead of directly referencing it.

For example, the following code designs a Score class to store the scores of a group of students:

java
// int[]
import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        int[] scores = new int[] { 88, 77, 51, 66 };
        Score s = new Score(scores);
        s.printScores();
        scores[2] = 99;
        s.printScores();
    }
}

class Score {
    private int[] scores;
    public Score(int[] scores) {
        this.scores = scores;
    }

    public void printScores() {
        System.out.println(Arrays.toString(scores));
    }
}

Observe the two outputs. Since Score directly references the int[] array passed in from the outside, this will cause external code to modify the int[] array and affect the fields of Score class. If the external code cannot be trusted, this creates a security risk.

Please fix the constructor of Score so that modifications to the array by external code do not affect the int[] field of Score instance.

character encoding

In early computer systems, in order to encode characters, the American National Standard Institute (ANSI) developed a set of encodings for English letters, numbers and common symbols. It occupies one byte and the encoding range is from 0 to 127 , the highest bit is always 0 , which is called ASCII encoding. For example, the encoding of character 'A' is 0x41 and the encoding of character '1' is 0x31 .

If you want to include Chinese characters into computer coding, it is obvious that one byte is not enough. The GB2312 standard uses two bytes to represent a Chinese character, and the highest bit of the first byte is always 1 to distinguish it from ASCII encoding. For example, the GB2312 encoding of the Chinese character '中' is 0xd6d0 .

Similarly, Japanese has Shift_JIS encoding and Korean has EUC-KR encoding. Because these codes are not unified in standard, conflicts will occur if they are used at the same time.

In order to unify the encoding of all languages around the world, the Global Unicode Alliance released Unicode encoding, which incorporates the world's major languages into the same encoding so that Chinese, Japanese, Korean and other languages will not conflict.

Unicode encoding requires two or more bytes to represent. We can compare the encoding of Chinese and English characters in ASCII , GB2312 and Unicode :

ASCII encoding and Unicode encoding of English character 'A' :

         ┌────┐
ASCII:   │ 41 │
         └────┘
         ┌────┬────┐
Unicode: │ 00 │ 41 │
         └────┴────┘

The Unicode encoding of English characters is simply to add a 00 byte in front.

GB2312 encoding and Unicode encoding of the Chinese character '中' :

         ┌────┬────┐
GB2312:  │ d6 │ d0 │
         └────┴────┘
         ┌────┬────┐
Unicode: │ 4e │ 2d │
         └────┴────┘

So what is the UTF-8 encoding we often use? Because the high byte of Unicode encoding for English characters is always 00 , text containing a large amount of English will waste space. Therefore, UTF-8 encoding appeared, which is a variable-length encoding used to convert fixed-length Unicode encoding into 1 ~4-byte variable-length encoding. Through UTF-8 encoding, UTF-8 encoding of the English character 'A' becomes 0x41 , which is exactly the same as ASCII code, while UTF-8 encoding of the Chinese '中' is 3 bytes 0xe4b8ad .

Another benefit of UTF-8 encoding is fault tolerance. If some characters are wrong during transmission, subsequent characters will not be affected, because UTF-8 encoding relies on the high byte bits to determine how many bytes a character is, and it is often used as a transmission encoding.

In Java, the char type is actually a two-byte Unicode encoding. If we want to manually convert the string to another encoding, we can do this:

java
byte[] b1 = "Hello".getBytes(); // Convert according to system default encoding, not recommended
byte[] b2 = "Hello".getBytes("UTF-8"); // Convert to UTF-8 encoding
byte[] b2 = "Hello".getBytes("GBK"); // Convert according to GBK encoding
byte[] b3 = "Hello".getBytes(StandardCharsets.UTF_8); // Convert to UTF-8 encoding

Note: After the encoding is converted, it is no longer a char type, but an array represented by the byte type.

If you want to convert a known encoding byte[] to String , you can do this:

java
byte[] b = ...
String s1 = new String(b, "GBK"); // Convert by GBK
String s2 = new String(b, StandardCharsets.UTF_8); // Convert to UTF-8

Always remember: Java's String and char are always represented in Unicode encoding in memory.

Further reading

For different versions of the JDK, the String class has different optimization methods in memory. Specifically, String in early JDK versions is always stored as char[] , which is defined as follows:

java
public final class String {
    private final char[] value;
    private final int offset;
    private final int count;
}

String in newer JDK versions are stored in byte[] : if String only contains ASCII characters, one character is stored in each byte , otherwise, one character is stored in every two byte . The purpose of this is to save memory, because Large numbers of shorter String often contain only ASCII characters:

java
public final class String {
    private final byte[] value;
    private final byte coder; // 0 = LATIN1, 1 = UTF16

For users, String 's internal optimization does not affect any existing code because its public method signature remains unchanged.

Summary

Java string String is an immutable object;

String operations do not change the content of the original string, but return a new string;

Commonly used string operations: extracting substrings, searching, replacing, case conversion, etc.;

Java uses Unicode encoding to represent String and char ;

Conversion encoding is to convert String and byte[] , and the encoding needs to be specified;

When converting to byte[] , UTF-8 encoding is always preferred.

Strings and encodings has loaded