Appearance
Strings and encodings
String
In Java, String
is a reference type, which itself is a class
. However, the Java compiler has special processing for String
, that is, you can directly use "..."
to represent a string:
java
String s1 = "Hello!";
In fact, the string is represented by a char[]
array inside String
. Therefore, it is also possible to write it as follows:
java
String s2 = new String(new char[] {'H', 'e', 'l', 'l', 'o', '!'});
Because String
is so commonly used, Java provides the "..."
string literal representation method.
An important feature of Java strings is that strings are immutable . This immutability is achieved through the internal private final char[]
field and the absence of any method to modify char[]
.
Let's look at an example:
java
public class Main {
public static void main(String[] args) {
String s = "Hello";
System.out.println(s);
s = s.toUpperCase();
System.out.println(s);
}
}
Based on the output of the above code, try to explain whether the string content has changed.
String comparison
When we want to compare whether two strings are the same, we need to pay special attention to whether we actually want to compare whether the contents of the strings are the same. The equals()
method must be used instead of ==
.
Let's look at the following example:
java
public class Main {
public static void main(String[] args) {
String s1 = "hello";
String s2 = "hello";
System.out.println(s1 == s2);
System.out.println(s1.equals(s2));
}
}
On the surface, the two strings are both true
when compared with ==
and equals()
, but in fact it is just that the Java compiler will automatically treat all the same strings as an object and put them into the constant pool during compilation. Naturally, the references of s1
and s2
are the same.
Therefore, it is purely coincidental that this ==
comparison returns true
. Written differently, the ==
comparison will fail:
java
public class Main {
public static void main(String[] args) {
String s1 = "hello";
String s2 = "HELLO".toLowerCase();
System.out.println(s1 == s2);
System.out.println(s1.equals(s2));
}
}
Conclusion: To compare two strings, you must always use equals()
method.
To ignore case comparisons, use equalsIgnoreCase()
method.
The String
class also provides a variety of methods to search for and extract substrings. Commonly used methods are:
java
// Does it contain substrings?:
"Hello".contains("ll"); // true
Note that the parameter of the contains()
method is CharSequence
instead of String
, because CharSequence
is an interface implemented by String
.
More examples of searching for substrings:
java
"Hello".indexOf("l"); // 2
"Hello".lastIndexOf("l"); // 3
"Hello".startsWith("He"); // true
"Hello".endsWith("lo"); // true
Example of extracting substrings:
java
"Hello".substring(2); // "llo"
"Hello".substring(2, 4); "ll"
Note that the index numbers start from 0
.
Remove leading and trailing whitespace characters
Use the trim()
method to remove leading and trailing whitespace characters from a string. Whitespace characters include space, \t
, \r
, \n
:
java
" \tHello\r\n ".trim(); // "Hello"
Note: trim()
does not change the content of the string, but returns a new string.
Another strip()
method can also remove leading and trailing whitespace characters from a string. The difference between it and trim()
is that the Chinese-like space character \u3000
will also be removed:
java
"\u3000Hello\u3000".strip(); // "Hello"
" Hello ".stripLeading(); // "Hello "
" Hello ".stripTrailing(); // " Hello"
String
also provides isEmpty()
and isBlank()
to determine whether the string is empty and blank string:
java
"".isEmpty(); // true,Because the string length is 0
" ".isEmpty(); // false,Because the string length is not 0
" \n".isBlank(); // true,Because it only contains whitespace characters
" Hello ".isBlank(); // false,Because it contains non-whitespace characters
replace substring
To replace a substring in a string, there are two ways. One is to replace based on characters or strings:
java
String s = "hello";
s.replace('l', 'w'); // "hewwo",All characters 'l' are replaced with 'w'
s.replace("ll", "~~"); // "he~~o",All substrings "ll" are replaced with "~~"
The other is through regular expression replacement:
java
String s = "A,,B;C ,D";
s.replaceAll("[\\,\\;\\s]+", ","); // "A,B,C,D"
The above code uses regular expressions to uniformly replace matching substrings with ","
. We will explain the usage of regular expressions in detail later.
split string
To split a string, use the split()
method and pass in a regular expression:
java
String s = "A,B,C,D";
String[] ss = s.split("\\,"); // {"A", "B", "C", "D"}
Concatenate strings
To concatenate strings, use the static method join()
, which concatenates string arrays with the specified string:
java
String[] arr = {"A", "B", "C"};
String s = String.join("***", arr); // "A***B***C"
Format string
Strings provide formatted()
method and format()
static method, which can pass in other parameters, replace placeholders, and then generate a new string:
java
public class Main {
public static void main(String[] args) {
String s = "Hi %s, your score is %d!";
System.out.println(s.formatted("Alice", 80));
System.out.println(String.format("Hi %s, your score is %.2f!", "Bob", 59.5));
}
}
There are several placeholders, and several parameters will be passed in later. The parameter type must be consistent with the placeholder. We often use this method to format information. Commonly used placeholders are:
%s
: display string;%d
: Display integer;%x
: Display hexadecimal integer;%f
: Display floating point numbers.
Placeholders can also be formatted, for example, %.2f
means displaying two decimal places. If you are not sure what placeholder to use, always use %s
because %s
can display any data type. To see the complete formatting syntax, please refer to the JDK documentation .
type conversion
To convert any primitive or reference type to a string, use the static method valueOf()
. This is an overloaded method, and the compiler will automatically choose the appropriate method based on the parameters:
java
String.valueOf(123); // "123"
String.valueOf(45.67); // "45.67"
String.valueOf(true); // "true"
String.valueOf(new Object()); // Similar to java.lang.Object@636be97c
To convert strings to other types, you need to consider the situation. For example, convert a string to type int
:
java
int n1 = Integer.parseInt("123"); // 123
int n2 = Integer.parseInt("ff", 16); // Convert to hexadecimal,255
Convert string to boolean
type:
java
boolean b1 = Boolean.parseBoolean("true"); // true
boolean b2 = Boolean.parseBoolean("FALSE"); // false
Pay special attention to the fact that Integer
has a getInteger(String)
method, which does not convert the string to int
, but converts the system variable corresponding to the string to Integer
:
java
Integer.getInteger("java.version"); // version 11
Convert to char[]
String
and char[]
types can be converted to each other by:
java
char[] cs = "Hello".toCharArray(); // String -> char[]
String s = new String(cs); // char[] -> String
If the char[]
array is modified, String
will not change:
java
// String <-> char[]
public class Main {
public static void main(String[] args) {
char[] cs = "Hello".toCharArray();
String s = new String(cs);
System.out.println(s);
cs[0] = 'X';
System.out.println(s);
}
}
This is because when a new String
instance is created through new String(char[])
, it will not directly reference the incoming char[]
array, but will make a copy. Therefore, modifying the external char[]
array will not Affects the char[]
array inside String
instance because these are two different arrays.
It can be seen from the immutability design of String
that if the incoming object may change, we need to copy it instead of directly referencing it.
For example, the following code designs a Score
class to store the scores of a group of students:
java
// int[]
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
int[] scores = new int[] { 88, 77, 51, 66 };
Score s = new Score(scores);
s.printScores();
scores[2] = 99;
s.printScores();
}
}
class Score {
private int[] scores;
public Score(int[] scores) {
this.scores = scores;
}
public void printScores() {
System.out.println(Arrays.toString(scores));
}
}
Observe the two outputs. Since Score
directly references the int[]
array passed in from the outside, this will cause external code to modify the int[]
array and affect the fields of Score
class. If the external code cannot be trusted, this creates a security risk.
Please fix the constructor of Score
so that modifications to the array by external code do not affect the int[]
field of Score
instance.
character encoding
In early computer systems, in order to encode characters, the American National Standard Institute (ANSI) developed a set of encodings for English letters, numbers and common symbols. It occupies one byte and the encoding range is from 0
to 127
, the highest bit is always 0
, which is called ASCII
encoding. For example, the encoding of character 'A'
is 0x41
and the encoding of character '1'
is 0x31
.
If you want to include Chinese characters into computer coding, it is obvious that one byte is not enough. The GB2312
standard uses two bytes to represent a Chinese character, and the highest bit of the first byte is always 1
to distinguish it from ASCII
encoding. For example, the GB2312
encoding of the Chinese character '中'
is 0xd6d0
.
Similarly, Japanese has Shift_JIS
encoding and Korean has EUC-KR
encoding. Because these codes are not unified in standard, conflicts will occur if they are used at the same time.
In order to unify the encoding of all languages around the world, the Global Unicode Alliance released Unicode
encoding, which incorporates the world's major languages into the same encoding so that Chinese, Japanese, Korean and other languages will not conflict.
Unicode
encoding requires two or more bytes to represent. We can compare the encoding of Chinese and English characters in ASCII
, GB2312
and Unicode
:
ASCII
encoding and Unicode
encoding of English character 'A'
:
┌────┐
ASCII: │ 41 │
└────┘
┌────┬────┐
Unicode: │ 00 │ 41 │
└────┴────┘
The Unicode
encoding of English characters is simply to add a 00
byte in front.
GB2312
encoding and Unicode
encoding of the Chinese character '中'
:
┌────┬────┐
GB2312: │ d6 │ d0 │
└────┴────┘
┌────┬────┐
Unicode: │ 4e │ 2d │
└────┴────┘
So what is the UTF-8
encoding we often use? Because the high byte of Unicode
encoding for English characters is always 00
, text containing a large amount of English will waste space. Therefore, UTF-8
encoding appeared, which is a variable-length encoding used to convert fixed-length Unicode
encoding into 1 ~4-byte variable-length encoding. Through UTF-8
encoding, UTF-8
encoding of the English character 'A'
becomes 0x41
, which is exactly the same as ASCII
code, while UTF-8
encoding of the Chinese '中'
is 3 bytes 0xe4b8ad
.
Another benefit of UTF-8
encoding is fault tolerance. If some characters are wrong during transmission, subsequent characters will not be affected, because UTF-8
encoding relies on the high byte bits to determine how many bytes a character is, and it is often used as a transmission encoding.
In Java, the char
type is actually a two-byte Unicode
encoding. If we want to manually convert the string to another encoding, we can do this:
java
byte[] b1 = "Hello".getBytes(); // Convert according to system default encoding, not recommended
byte[] b2 = "Hello".getBytes("UTF-8"); // Convert to UTF-8 encoding
byte[] b2 = "Hello".getBytes("GBK"); // Convert according to GBK encoding
byte[] b3 = "Hello".getBytes(StandardCharsets.UTF_8); // Convert to UTF-8 encoding
Note: After the encoding is converted, it is no longer a char
type, but an array represented by the byte
type.
If you want to convert a known encoding byte[]
to String
, you can do this:
java
byte[] b = ...
String s1 = new String(b, "GBK"); // Convert by GBK
String s2 = new String(b, StandardCharsets.UTF_8); // Convert to UTF-8
Always remember: Java's String
and char
are always represented in Unicode encoding in memory.
Further reading
For different versions of the JDK, the String
class has different optimization methods in memory. Specifically, String
in early JDK versions is always stored as char[]
, which is defined as follows:
java
public final class String {
private final char[] value;
private final int offset;
private final int count;
}
String
in newer JDK versions are stored in byte[]
: if String
only contains ASCII characters, one character is stored in each byte
, otherwise, one character is stored in every two byte
. The purpose of this is to save memory, because Large numbers of shorter String
often contain only ASCII characters:
java
public final class String {
private final byte[] value;
private final byte coder; // 0 = LATIN1, 1 = UTF16
For users, String
's internal optimization does not affect any existing code because its public
method signature remains unchanged.
Summary
Java string String
is an immutable object;
String operations do not change the content of the original string, but return a new string;
Commonly used string operations: extracting substrings, searching, replacing, case conversion, etc.;
Java uses Unicode encoding to represent String
and char
;
Conversion encoding is to convert String
and byte[]
, and the encoding needs to be specified;
When converting to byte[]
, UTF-8
encoding is always preferred.