Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions src/string_bytes.cc
Original file line number Diff line number Diff line change
Expand Up @@ -266,8 +266,17 @@ size_t StringBytes::Write(Isolate* isolate,

case BUFFER:
case UTF8:
nbytes = str->WriteUtf8V2(
isolate, buf, buflen, String::WriteFlags::kReplaceInvalidUtf8);
if (input_view.is_one_byte()) {
// Use simdutf for one-byte strings instead of V8's WriteUtf8V2.
nbytes = simdutf::convert_latin1_to_utf8_safe(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the fast case where v8 string is already flattened and is a one byte string won't WriteOneByteV2 be just as fast as using simdutf here? it would be nice to see if there is any difference between them at least and maybe it would be better to use the v8 implementation? It will just end up being a memcpy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't a pure copy case Latin1 characters in the 0x80–0xFF range expand to 2-byte UTF-8 sequences, so WriteOneByteV2 (which is essentially a memcpy of raw bytes) wouldn't produce valid UTF-8 output. The gain here comes from bypassing V8's generic WriteUtf8V2 (which handles both one-byte and two-byte strings with extra checks) and going directly to simdutf's optimized Latin1 → UTF-8 conversion for the one-byte fast path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll have to take a look at the v8 impl - you are probably correct. should be able to do some comparative testing/benchmarking later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mertcanaltin - yeah, you are right about SeqOneByteString - Latin1 strings will be encoded internally in v8 as one byte but anything in range 0x80-0xFF will need to be expanded for UTF-8. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review!

reinterpret_cast<const char*>(input_view.data8()),
input_view.length(),
buf,
buflen);
} else {
nbytes = str->WriteUtf8V2(
isolate, buf, buflen, String::WriteFlags::kReplaceInvalidUtf8);
}
break;

case UCS2: {
Expand Down
Loading