-
Notifications
You must be signed in to change notification settings - Fork 6
/
nep-0055-string_dtype.html
1757 lines (1548 loc) · 138 KB
/
nep-0055-string_dtype.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en" data-content_root="./" >
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<title>NEP 55 — Add a UTF-8 variable-width string DType to NumPy — NumPy Enhancement Proposals</title>
<script data-cfasync="false">
document.documentElement.dataset.mode = localStorage.getItem("mode") || "";
document.documentElement.dataset.theme = localStorage.getItem("theme") || "";
</script>
<!--
this give us a css class that will be invisible only if js is disabled
-->
<noscript>
<style>
.pst-js-only { display: none !important; }
</style>
</noscript>
<!-- Loaded before other Sphinx assets -->
<link href="_static/styles/theme.css?digest=8878045cc6db502f8baf" rel="stylesheet" />
<link href="_static/styles/pydata-sphinx-theme.css?digest=8878045cc6db502f8baf" rel="stylesheet" />
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=03e43079" />
<!-- So that users can add custom icons -->
<script src="_static/scripts/fontawesome.js?digest=8878045cc6db502f8baf"></script>
<!-- Pre-loaded scripts that we'll load fully later -->
<link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=8878045cc6db502f8baf" />
<link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=8878045cc6db502f8baf" />
<script src="_static/documentation_options.js?v=7f41d439"></script>
<script src="_static/doctools.js?v=888ff710"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<script>DOCUMENTATION_OPTIONS.pagename = 'nep-0055-string_dtype';</script>
<link rel="icon" href="_static/favicon.ico"/>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="NEP 56 — Array API standard support in NumPy’s main namespace" href="nep-0056-array-api-main-namespace.html" />
<link rel="prev" title="NEP 52 — Python API cleanup for NumPy 2.0" href="nep-0052-python-api-cleanup.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="en"/>
<meta name="docsearch:version" content="" />
<meta name="docbuild:last-update" content="Jan 10, 2025"/>
</head>
<body data-bs-spy="scroll" data-bs-target=".bd-toc-nav" data-offset="180" data-bs-root-margin="0px 0px -60%" data-default-mode="">
<div id="pst-skip-link" class="skip-link d-print-none"><a href="#main-content">Skip to main content</a></div>
<div id="pst-scroll-pixel-helper"></div>
<button type="button" class="btn rounded-pill" id="pst-back-to-top">
<i class="fa-solid fa-arrow-up"></i>Back to top</button>
<dialog id="pst-search-dialog">
<form class="bd-search d-flex align-items-center"
action="search.html"
method="get">
<i class="fa-solid fa-magnifying-glass"></i>
<input type="search"
class="form-control"
name="q"
placeholder="Search the docs ..."
aria-label="Search the docs ..."
autocomplete="off"
autocorrect="off"
autocapitalize="off"
spellcheck="false"/>
<span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span>
</form>
</dialog>
<div class="pst-async-banner-revealer d-none">
<aside id="bd-header-version-warning" class="d-none d-print-none" aria-label="Version warning"></aside>
</div>
<header class="bd-header navbar navbar-expand-lg bd-navbar d-print-none">
<div class="bd-header__inner bd-page-width">
<button class="pst-navbar-icon sidebar-toggle primary-toggle" aria-label="Site navigation">
<span class="fa-solid fa-bars"></span>
</button>
<div class="col-lg-3 navbar-header-items__start">
<div class="navbar-item">
<a class="navbar-brand logo" href="content.html">
<img src="_static/numpylogo.svg" class="logo__image only-light" alt="NumPy Enhancement Proposals - Home"/>
<img src="_static/numpylogo.svg" class="logo__image only-dark pst-js-only" alt="NumPy Enhancement Proposals - Home"/>
</a></div>
</div>
<div class="col-lg-9 navbar-header-items">
<div class="me-auto navbar-header-items__center">
<div class="navbar-item">
<nav>
<ul class="bd-navbar-elements navbar-nav">
<li class="nav-item current active">
<a class="nav-link nav-internal" href="index.html">
Index
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-internal" href="scope.html">
The Scope of NumPy
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-internal" href="roadmap.html">
Current roadmap
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-external" href="https://github.com/numpy/numpy/issues?q=is%3Aopen+is%3Aissue+label%3A%2223+-+Wish+List%22">
Wish list
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-external" href="https://github.com/numpy/numpy/issues?q=is%3Aopen+is%3Aissue+label%3A%2223+-+Wish+List%22">
Wishlist
</a>
</li>
</ul>
</nav></div>
</div>
<div class="navbar-header-items__end">
<div class="navbar-item navbar-persistent--container">
<button class="btn search-button-field search-button__button pst-js-only" title="Search" aria-label="Search" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="fa-solid fa-magnifying-glass"></i>
<span class="search-button__default-text">Search</span>
<span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd class="kbd-shortcut__modifier">K</kbd></span>
</button>
</div>
<div class="navbar-item">
<button class="btn btn-sm nav-link pst-navbar-icon theme-switch-button pst-js-only" aria-label="Color mode" data-bs-title="Color mode" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="theme-switch fa-solid fa-sun fa-lg" data-mode="light" title="Light"></i>
<i class="theme-switch fa-solid fa-moon fa-lg" data-mode="dark" title="Dark"></i>
<i class="theme-switch fa-solid fa-circle-half-stroke fa-lg" data-mode="auto" title="System Settings"></i>
</button></div>
<div class="navbar-item"><ul class="navbar-icon-links"
aria-label="Icon Links">
<li class="nav-item">
<a href="https://github.com/numpy/numpy" title="GitHub" class="nav-link pst-navbar-icon" rel="noopener" target="_blank" data-bs-toggle="tooltip" data-bs-placement="bottom"><i class="fa-brands fa-square-github fa-lg" aria-hidden="true"></i>
<span class="sr-only">GitHub</span></a>
</li>
</ul></div>
</div>
</div>
<div class="navbar-persistent--mobile">
<button class="btn search-button-field search-button__button pst-js-only" title="Search" aria-label="Search" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="fa-solid fa-magnifying-glass"></i>
<span class="search-button__default-text">Search</span>
<span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd class="kbd-shortcut__modifier">K</kbd></span>
</button>
</div>
<button class="pst-navbar-icon sidebar-toggle secondary-toggle" aria-label="On this page">
<span class="fa-solid fa-outdent"></span>
</button>
</div>
</header>
<div class="bd-container">
<div class="bd-container__inner bd-page-width">
<dialog id="pst-primary-sidebar-modal"></dialog>
<div id="pst-primary-sidebar" class="bd-sidebar-primary bd-sidebar">
<div class="sidebar-header-items sidebar-primary__section">
<div class="sidebar-header-items__center">
<div class="navbar-item">
<nav>
<ul class="bd-navbar-elements navbar-nav">
<li class="nav-item current active">
<a class="nav-link nav-internal" href="index.html">
Index
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-internal" href="scope.html">
The Scope of NumPy
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-internal" href="roadmap.html">
Current roadmap
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-external" href="https://github.com/numpy/numpy/issues?q=is%3Aopen+is%3Aissue+label%3A%2223+-+Wish+List%22">
Wish list
</a>
</li>
<li class="nav-item ">
<a class="nav-link nav-external" href="https://github.com/numpy/numpy/issues?q=is%3Aopen+is%3Aissue+label%3A%2223+-+Wish+List%22">
Wishlist
</a>
</li>
</ul>
</nav></div>
</div>
<div class="sidebar-header-items__end">
<div class="navbar-item">
<button class="btn btn-sm nav-link pst-navbar-icon theme-switch-button pst-js-only" aria-label="Color mode" data-bs-title="Color mode" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="theme-switch fa-solid fa-sun fa-lg" data-mode="light" title="Light"></i>
<i class="theme-switch fa-solid fa-moon fa-lg" data-mode="dark" title="Dark"></i>
<i class="theme-switch fa-solid fa-circle-half-stroke fa-lg" data-mode="auto" title="System Settings"></i>
</button></div>
<div class="navbar-item"><ul class="navbar-icon-links"
aria-label="Icon Links">
<li class="nav-item">
<a href="https://github.com/numpy/numpy" title="GitHub" class="nav-link pst-navbar-icon" rel="noopener" target="_blank" data-bs-toggle="tooltip" data-bs-placement="bottom"><i class="fa-brands fa-square-github fa-lg" aria-hidden="true"></i>
<span class="sr-only">GitHub</span></a>
</li>
</ul></div>
</div>
</div>
<div class="sidebar-primary-items__start sidebar-primary__section">
<div class="sidebar-primary-item">
<nav class="bd-docs-nav bd-links"
aria-label="Section Navigation">
<p class="bd-links__title" role="heading" aria-level="1">Section Navigation</p>
<div class="bd-toc-item navbar-nav"><ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="scope.html">The Scope of NumPy</a></li>
<li class="toctree-l1"><a class="reference internal" href="roadmap.html">Current roadmap</a></li>
<li class="toctree-l1"><a class="reference external" href="https://github.com/numpy/numpy/issues?q=is%3Aopen+is%3Aissue+label%3A%2223+-+Wish+List%22">Wish list</a></li>
</ul>
<ul class="current nav bd-sidenav">
<li class="toctree-l1 has-children"><a class="reference internal" href="meta.html">Meta-NEPs (NEPs about NEPs or active Processes)</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul>
<li class="toctree-l2"><a class="reference internal" href="nep-0000.html">NEP 0 — Purpose and process</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0023-backwards-compatibility.html">NEP 23 — Backwards compatibility and deprecation policy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0036-fair-play.html">NEP 36 — Fair play</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0045-c_style_guide.html">NEP 45 — C style guide</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0046-sponsorship-guidelines.html">NEP 46 — NumPy sponsorship guidelines</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0048-spending-project-funds.html">NEP 48 — Spending NumPy project funds</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-template.html">NEP X — Template and instructions</a></li>
</ul>
</details></li>
<li class="toctree-l1 has-children"><a class="reference internal" href="provisional.html">Provisional NEPs (provisionally accepted; interface may change)</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul class="simple">
</ul>
</details></li>
<li class="toctree-l1 has-children"><a class="reference internal" href="accepted.html">Accepted NEPs (implementation in progress)</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul>
<li class="toctree-l2"><a class="reference internal" href="nep-0041-improved-dtype-support.html">NEP 41 — First step towards a new datatype system</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0042-new-dtypes.html">NEP 42 — New and extensible DTypes</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0044-restructuring-numpy-docs.html">NEP 44 — Restructuring the NumPy documentation</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0051-scalar-representation.html">NEP 51 — Changing the representation of NumPy scalars</a></li>
</ul>
</details></li>
<li class="toctree-l1 has-children"><a class="reference internal" href="open.html">Open NEPs (under consideration)</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul>
<li class="toctree-l2"><a class="reference internal" href="nep-0043-extensible-ufuncs.html">NEP 43 — Enhancing the extensibility of UFuncs</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0053-c-abi-evolution.html">NEP 53 — Evolving the NumPy C-API for NumPy 2.0</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0054-simd-cpp-highway.html">NEP 54 — SIMD infrastructure evolution: adopting Google Highway when moving to C++?</a></li>
</ul>
</details></li>
<li class="toctree-l1 current active has-children"><a class="reference internal" href="finished.html">Finished NEPs</a><details open="open"><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="nep-0001-npy-format.html">NEP 1 — A simple file format for NumPy arrays</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0005-generalized-ufuncs.html">NEP 5 — Generalized universal functions</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0007-datetime-proposal.html">NEP 7 — A proposal for implementing some date/time types in NumPy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0010-new-iterator-ufunc.html">NEP 10 — Optimizing iterator/UFunc performance</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0013-ufunc-overrides.html">NEP 13 — A mechanism for overriding Ufuncs</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0014-dropping-python2.7-proposal.html">NEP 14 — Plan for dropping Python 2.7 support</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0015-merge-multiarray-umath.html">NEP 15 — Merging multiarray and umath</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0018-array-function-protocol.html">NEP 18 — A dispatch mechanism for NumPy's high level array functions</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0019-rng-policy.html">NEP 19 — Random number generator policy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0020-gufunc-signature-enhancement.html">NEP 20 — Expansion of generalized universal function signatures</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0022-ndarray-duck-typing-overview.html">NEP 22 — Duck typing for NumPy arrays – high level overview</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0027-zero-rank-arrarys.html">NEP 27 — Zero rank arrays</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0028-website-redesign.html">NEP 28 — numpy.org website redesign</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0029-deprecation_policy.html">NEP 29 — Recommend Python and NumPy version support as a community policy standard</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0032-remove-financial-functions.html">NEP 32 — Remove the financial functions from NumPy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0034-infer-dtype-is-object.html">NEP 34 — Disallow inferring ``dtype=object`` from sequences</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0035-array-creation-dispatch-with-array-function.html">NEP 35 — Array creation dispatching with __array_function__</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0038-SIMD-optimizations.html">NEP 38 — Using SIMD optimization instructions for performance</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0040-legacy-datatype-impl.html">NEP 40 — Legacy datatype implementation in NumPy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0049.html">NEP 49 — Data allocation strategies</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0050-scalar-promotion.html">NEP 50 — Promotion rules for Python scalars</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0052-python-api-cleanup.html">NEP 52 — Python API cleanup for NumPy 2.0</a></li>
<li class="toctree-l2 current active"><a class="current reference internal" href="#">NEP 55 — Add a UTF-8 variable-width string DType to NumPy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0056-array-api-main-namespace.html">NEP 56 — Array API standard support in NumPy's main namespace</a></li>
</ul>
</details></li>
<li class="toctree-l1 has-children"><a class="reference internal" href="deferred.html">Deferred and Superseded NEPs</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul>
<li class="toctree-l2"><a class="reference internal" href="nep-0002-warnfix.html">NEP 2 — A proposal to build numpy without warning with a big set of warning flags</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0003-math_config_clean.html">NEP 3 — Cleaning the math configuration of numpy.core</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0004-datetime-proposal3.html">NEP 4 — A (third) proposal for implementing some date/time types in NumPy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0006-newbugtracker.html">NEP 6 — Replacing Trac with a different bug tracker</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0008-groupby_additions.html">NEP 8 — A proposal for adding groupby functionality to NumPy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0009-structured_array_extensions.html">NEP 9 — Structured array extensions</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0011-deferred-ufunc-evaluation.html">NEP 11 — Deferred UFunc evaluation</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0012-missing-data.html">NEP 12 — Missing data functionality in NumPy</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0021-advanced-indexing.html">NEP 21 — Simplified and explicit advanced indexing</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0024-missing-data-2.html">NEP 24 — Missing data functionality - alternative 1 to NEP 12</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0025-missing-data-3.html">NEP 25 — NA support via special dtypes</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0026-missing-data-summary.html">NEP 26 — Summary of missing data NEPs and discussion</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0030-duck-array-protocol.html">NEP 30 — Duck typing for NumPy arrays - implementation</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0031-uarray.html">NEP 31 — Context-local and global overrides of the NumPy API</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0037-array-module.html">NEP 37 — A dispatch protocol for NumPy-like modules</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0047-array-api-standard.html">NEP 47 — Adopting the array API standard</a></li>
</ul>
</details></li>
<li class="toctree-l1 has-children"><a class="reference internal" href="rejected.html">Rejected and Withdrawn NEPs</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul>
<li class="toctree-l2"><a class="reference internal" href="nep-0016-abstract-array.html">NEP 16 — An abstract base class for identifying "duck arrays"</a></li>
<li class="toctree-l2"><a class="reference internal" href="nep-0017-split-out-maskedarray.html">NEP 17 — Split out masked arrays</a></li>
</ul>
</details></li>
</ul>
</div>
</nav></div>
</div>
<div class="sidebar-primary-items__end sidebar-primary__section">
<div class="sidebar-primary-item">
<div id="ethical-ad-placement"
class="flat"
data-ea-publisher="readthedocs"
data-ea-type="readthedocs-sidebar"
data-ea-manual="true">
</div></div>
</div>
</div>
<main id="main-content" class="bd-main" role="main">
<div class="bd-content">
<div class="bd-article-container">
<div class="bd-header-article d-print-none">
<div class="header-article-items header-article__inner">
<div class="header-article-items__start">
<div class="header-article-item">
<nav aria-label="Breadcrumb" class="d-print-none">
<ul class="bd-breadcrumbs">
<li class="breadcrumb-item breadcrumb-home">
<a href="content.html" class="nav-link" aria-label="Home">
<i class="fa-solid fa-home"></i>
</a>
</li>
<li class="breadcrumb-item"><a href="index.html" class="nav-link">Roadmap & NumPy enhancement proposals</a></li>
<li class="breadcrumb-item"><a href="finished.html" class="nav-link">Finished NEPs</a></li>
<li class="breadcrumb-item active" aria-current="page"><span class="ellipsis">NEP 55 — Add a UTF-8 variable-width string DType to NumPy</span></li>
</ul>
</nav>
</div>
</div>
</div>
</div>
<div id="searchbox"></div>
<article class="bd-article">
<section id="nep-55-add-a-utf-8-variable-width-string-dtype-to-numpy">
<span id="nep55"></span><h1>NEP 55 — Add a UTF-8 variable-width string DType to NumPy<a class="headerlink" href="#nep-55-add-a-utf-8-variable-width-string-dtype-to-numpy" title="Link to this heading">#</a></h1>
<dl class="field-list simple">
<dt class="field-odd">Author<span class="colon">:</span></dt>
<dd class="field-odd"><p>Nathan Goldbaum <<a class="reference external" href="mailto:ngoldbaum%40quansight.com">ngoldbaum<span>@</span>quansight<span>.</span>com</a>></p>
</dd>
<dt class="field-even">Author<span class="colon">:</span></dt>
<dd class="field-even"><p>Warren Weckesser</p>
</dd>
<dt class="field-odd">Author<span class="colon">:</span></dt>
<dd class="field-odd"><p>Marten van Kerkwijk</p>
</dd>
<dt class="field-even">Status<span class="colon">:</span></dt>
<dd class="field-even"><p>Final</p>
</dd>
<dt class="field-odd">Type<span class="colon">:</span></dt>
<dd class="field-odd"><p>Standards Track</p>
</dd>
<dt class="field-even">Created<span class="colon">:</span></dt>
<dd class="field-even"><p>2023-06-29</p>
</dd>
<dt class="field-odd">Updated<span class="colon">:</span></dt>
<dd class="field-odd"><p>2024-01-18</p>
</dd>
<dt class="field-even">Resolution<span class="colon">:</span></dt>
<dd class="field-even"><p><a class="reference external" href="https://mail.python.org/archives/list/numpy-discussion@python.org/thread/Y5CIKBZKMIOWSRYLJ64WV6DKM37QR76B/">https://mail.python.org/archives/list/numpy-discussion@python.org/thread/Y5CIKBZKMIOWSRYLJ64WV6DKM37QR76B/</a></p>
</dd>
</dl>
<section id="abstract">
<h2>Abstract<a class="headerlink" href="#abstract" title="Link to this heading">#</a></h2>
<p>We propose adding a new string data type to NumPy where each item in the array
is an arbitrary length UTF-8 encoded string. This will enable performance,
memory usage, and usability improvements for NumPy users, including:</p>
<ul class="simple">
<li><p>Memory savings for workflows that currently use fixed-width strings and store
primarily ASCII data or a mix of short and long strings in a single NumPy
array.</p></li>
<li><p>Downstream libraries and users will be able to move away from object arrays
currently used as a substitute for variable-length string arrays, unlocking
performance improvements by avoiding passes over the data outside of NumPy and
allowing use of fast GIL-releasing C casts and string ufuncs for string
operations.</p></li>
<li><p>A more intuitive user-facing API for working with arrays of Python strings,
without a need to think about the in-memory array representation.</p></li>
</ul>
</section>
<section id="motivation-and-scope">
<h2>Motivation and scope<a class="headerlink" href="#motivation-and-scope" title="Link to this heading">#</a></h2>
<p>First, we will describe how the current state of support for string or
string-like data in NumPy arose. Next, we will summarize the last major previous
discussion about this topic. Finally, we will describe the scope of the proposed
changes to NumPy as well as changes that are explicitly out of scope of this
proposal.</p>
<section id="history-of-string-support-in-numpy">
<h3>History of string support in Numpy<a class="headerlink" href="#history-of-string-support-in-numpy" title="Link to this heading">#</a></h3>
<p>Support in NumPy for textual data evolved organically in response to early user
needs and then changes in the Python ecosystem.</p>
<p>Support for strings was added to NumPy to support users of the NumArray
<code class="docutils literal notranslate"><span class="pre">chararray</span></code> type. Remnants of this are still visible in the NumPy API:
string-related functionality lives in <code class="docutils literal notranslate"><span class="pre">np.char</span></code>, to support the
<code class="docutils literal notranslate"><span class="pre">np.char.chararray</span></code> class. This class is not formally deprecated, but has a
had comment in the module docstring suggesting to use string dtypes instead
since NumPy 1.4.</p>
<p>NumPy’s <code class="docutils literal notranslate"><span class="pre">bytes_</span></code> DType was originally used to represent the Python 2 <code class="docutils literal notranslate"><span class="pre">str</span></code>
type before Python 3 support was added to NumPy. The bytes DType makes the most
sense when it is used to represent Python 2 strings or other null-terminated
byte sequences. However, ignoring trailing nulls means the <code class="docutils literal notranslate"><span class="pre">bytes_</span></code> DType is
only suitable for fixed-width bytestreams that do not contain trailing nulls, so
it is a possibly problematic match for generic bytestreams where trailing nulls
need to round-trip through a NumPy string.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">unicode</span></code> DType was added to support the Python 2 <code class="docutils literal notranslate"><span class="pre">unicode</span></code> type. It
stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for
a straightforward implementation, but is inefficient for storing text that can
be represented well using a one-byte ASCII or Latin-1 encoding. This was not a
problem in Python 2, where ASCII or mostly-ASCII text could use the <code class="docutils literal notranslate"><span class="pre">str</span></code>
DType.</p>
<p>With the arrival of Python 3 support in NumPy, the string DTypes were largely
left alone due to backward compatibility concerns, although the unicode DType
became the default DType for <code class="docutils literal notranslate"><span class="pre">str</span></code> data and the old <code class="docutils literal notranslate"><span class="pre">string</span></code> DType was
renamed the <code class="docutils literal notranslate"><span class="pre">bytes_</span></code> DType. This change left NumPy with the sub-optimal
situation of shipping a data type originally intended for null-terminated
bytestrings as the data type for <em>all</em> python <code class="docutils literal notranslate"><span class="pre">bytes</span></code> data, and a default
string type with an in-memory representation that consumes four times as much
memory than what is needed for data that can be represented well by a one-byte
ASCII or Latin-1 encoding.</p>
</section>
<section id="problems-with-fixed-width-strings">
<h3>Problems with fixed-width strings<a class="headerlink" href="#problems-with-fixed-width-strings" title="Link to this heading">#</a></h3>
<p>Both existing string DTypes represent fixed-width sequences, allowing storage of
the string data in the array buffer. This avoids adding out-of-band storage to
NumPy, however, it makes for an awkward user interface for many use cases. In
particular, the maximum string size must be inferred by NumPy or estimated by
the user before loading the data into a NumPy array or selecting an output DType
for string operations. In the worst case, this requires an expensive pass over
the full dataset to calculate the maximum length of an array element. It also
wastes memory when array elements have varying lengths. Pathological cases where
an array stores many short strings and a few very long strings are particularly
bad for wasting memory.</p>
<p>Downstream usage of string data in NumPy arrays has proven out the need for a
variable-width string data type. In practice, many downstream libraries avoid
using fixed-width strings due to usability issues and instead employ <code class="docutils literal notranslate"><span class="pre">object</span></code>
arrays for storing strings. In particular, Pandas has explicitly deprecated
support for NumPy fixed-width strings, coerces NumPy fixed-width string arrays
to either <code class="docutils literal notranslate"><span class="pre">object</span></code> string arrays or <code class="docutils literal notranslate"><span class="pre">PyArrow</span></code>-backed string arrays, and in
the future will switch to only supporting string data via <code class="docutils literal notranslate"><span class="pre">PyArrow</span></code>, which has
native support for UTF-8 encoded variable-width string arrays <a class="footnote-reference brackets" href="#id14" id="id1" role="doc-noteref"><span class="fn-bracket">[</span>1<span class="fn-bracket">]</span></a>.</p>
</section>
</section>
<section id="previous-discussions">
<h2>Previous discussions<a class="headerlink" href="#previous-discussions" title="Link to this heading">#</a></h2>
<p>The project last publicly discussed this topic in depth in 2017, when Julian
Taylor proposed a fixed-width text data type parameterized by an encoding
<a class="footnote-reference brackets" href="#id15" id="id2" role="doc-noteref"><span class="fn-bracket">[</span>2<span class="fn-bracket">]</span></a>. This started a wide-ranging discussion about pain points for working with
string data in NumPy and possible ways forward.</p>
<p>The discussion highlighted two use-cases that the current support for strings
does a poor job of handling <a class="footnote-reference brackets" href="#id16" id="id3" role="doc-noteref"><span class="fn-bracket">[</span>3<span class="fn-bracket">]</span></a> <a class="footnote-reference brackets" href="#id17" id="id4" role="doc-noteref"><span class="fn-bracket">[</span>4<span class="fn-bracket">]</span></a> <a class="footnote-reference brackets" href="#id18" id="id5" role="doc-noteref"><span class="fn-bracket">[</span>5<span class="fn-bracket">]</span></a>:</p>
<ul class="simple">
<li><p>Loading or memory-mapping scientific datasets with unknown encoding,</p></li>
<li><p>Working with “a NumPy array of python strings” in a manner that allows
transparent conversion between NumPy arrays and Python strings, including
support for missing strings. The <code class="docutils literal notranslate"><span class="pre">object</span></code> DType partially satisfies this
need, albeit with a cost of slow performance and no type checking.</p></li>
</ul>
<p>As a result of this discussion, improving support for string data was added to
the NumPy project roadmap <a class="footnote-reference brackets" href="#id19" id="id6" role="doc-noteref"><span class="fn-bracket">[</span>6<span class="fn-bracket">]</span></a>, with an explicit call-out to add a DType better
suited to memory-mapping bytes with any or no encoding, and a variable-width
string DType that supports missing data to replace usages of object string
arrays.</p>
</section>
<section id="proposed-work">
<h2>Proposed work<a class="headerlink" href="#proposed-work" title="Link to this heading">#</a></h2>
<p>This NEP proposes adding <code class="docutils literal notranslate"><span class="pre">StringDType</span></code>, a DType that stores variable-width
heap-allocated strings in Numpy arrays, to replace downstream usages of the
<code class="docutils literal notranslate"><span class="pre">object</span></code> DType for string data. This work will heavily leverage recent
improvements in NumPy to improve support for user-defined DTypes, so we will
also necessarily be working on the data type internals in NumPy. In particular,
we propose to:</p>
<ul class="simple">
<li><p>Add a new variable-length string DType to NumPy, targeting NumPy 2.0.</p></li>
<li><p>Work out issues related to adding a DType implemented using the experimental
DType API to NumPy itself.</p></li>
<li><p>Support for a user-provided missing data sentinel.</p></li>
<li><p>Exposing string ufuncs in a new <code class="docutils literal notranslate"><span class="pre">np.strings</span></code> namespace for functions and
types related to string support, enabling a migration path for a future
deprecation of <code class="docutils literal notranslate"><span class="pre">np.char</span></code>.</p></li>
</ul>
<p>The following is out of scope for this work:</p>
<ul class="simple">
<li><p>Changing DType inference for string data.</p></li>
<li><p>Adding a DType for memory-mapping text in unknown encodings or a DType that
attempts to fix issues with the <code class="docutils literal notranslate"><span class="pre">bytes_</span></code> DType.</p></li>
<li><p>Fully agreeing on the semantics of a missing data sentinels or adding a
missing data sentinel to NumPy itself.</p></li>
<li><p>Implement SIMD optimizations for string operations.</p></li>
<li><p>An update to the <code class="docutils literal notranslate"><span class="pre">npy</span></code> and <code class="docutils literal notranslate"><span class="pre">npz</span></code> file formats to allow storage of
arbitrary-length sidecar data.</p></li>
</ul>
<p>While we’re explicitly ruling out implementing these items as part of this work,
adding a new string DType helps set up future work that does implement some of
these items.</p>
<p>If implemented this NEP will make it easier to add a new fixed-width text DType
in the future by moving string operations into a long-term supported namespace
and improving the internal infrastructure in NumPy for handling strings. We are
also proposing a memory layout that should be amenable to SIMD optimization in
some cases, increasing the payoff for writing string operations as
SIMD-optimized ufuncs in the future.</p>
<p>While we are not proposing adding a missing data sentinel to NumPy, we are
proposing adding support for an optional, user-provided missing data sentinel,
so this does move NumPy a little closer to officially supporting missing
data. We are attempting to avoid resolving the disagreement described in
<a class="reference internal" href="nep-0026-missing-data-summary.html#nep26"><span class="std std-ref">NEP 26</span></a> and this proposal does not require or preclude adding a
missing data sentinel or bitflag-based missing data support to <code class="docutils literal notranslate"><span class="pre">ndarray</span></code> in
the future.</p>
</section>
<section id="usage-and-impact">
<h2>Usage and impact<a class="headerlink" href="#usage-and-impact" title="Link to this heading">#</a></h2>
<p>The DType is intended as a drop-in replacement for object string arrays. This
means that we intend to support as many downstream usages of object string
arrays as possible, including all supported NumPy functionality. Pandas is the
obvious first user, and substantial work has already occurred to add support in
a fork of Pandas. <code class="docutils literal notranslate"><span class="pre">scikit-learn</span></code> also uses object string arrays and will be
able to migrate to a DType with guarantees that the arrays contains only
strings. Both h5py <a class="footnote-reference brackets" href="#id20" id="id7" role="doc-noteref"><span class="fn-bracket">[</span>7<span class="fn-bracket">]</span></a> and PyTables <a class="footnote-reference brackets" href="#id21" id="id8" role="doc-noteref"><span class="fn-bracket">[</span>8<span class="fn-bracket">]</span></a> will be able to add first-class
support for variable-width UTF-8 encoded string datasets in HDF5. String data
are heavily used in machine-learning workflows and downstream machine learning
libraries will be able to leverage this new DType.</p>
<p>Users who wish to load string data into NumPy and leverage NumPy features like
fancy advanced indexing will have a natural choice that offers substantial
memory savings over fixed-width unicode strings and better validation guarantees
and overall integration with NumPy than object string arrays. Moving to a
first-class string DType also removes the need to acquire the GIL during string
operations, unlocking future optimizations that are impossible with object
string arrays.</p>
<section id="performance">
<h3>Performance<a class="headerlink" href="#performance" title="Link to this heading">#</a></h3>
<p>Here we briefly describe preliminary performance measurements of the prototype
version of <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> we have implemented outside of NumPy using the
experimental DType API. All benchmarks in this section were performed on a Dell
XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy,
Pandas, and the <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> prototype were all compiled with meson release
builds.</p>
<p>Currently, the <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> prototype has comparable performance with object
arrays and fixed-width string arrays. One exception is array creation from
python strings, performance is somewhat slower than object arrays and comparable
to fixed-width unicode arrays:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>In [1]: from stringdtype import StringDType
In [2]: import numpy as np
In [3]: data = [str(i) * 10 for i in range(100_000)]
In [4]: %timeit arr_object = np.array(data, dtype=object)
3.15 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType())
8.8 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit arr_strdtype = np.array(data, dtype=str)
11.6 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</pre></div>
</div>
<p>In this example, object DTypes are substantially faster because the objects in
the <code class="docutils literal notranslate"><span class="pre">data</span></code> list can be directly interned in the array, while <code class="docutils literal notranslate"><span class="pre">StrDType</span></code> and
<code class="docutils literal notranslate"><span class="pre">StringDType</span></code> need to copy the string data and <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> needs to
convert the data to UTF-8 and perform additional heap allocations outside the
array buffer. In the future, if Python moves to a UTF-8 internal representation
for strings, the string loading performance of <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> should improve.</p>
<p>String operations have similar performance:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object)
31.6 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit np.char.capitalize(arr_stringdtype)
41.5 ms ± 84.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit np.char.capitalize(arr_strdtype)
47.6 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
</pre></div>
</div>
<p>The poor performance here is a reflection of the slow iterator-based
implementation of operations in <code class="docutils literal notranslate"><span class="pre">np.char</span></code>. When we finish rewriting these
operations as ufuncs, we will unlock substantial performance
improvements. Using the example of the <code class="docutils literal notranslate"><span class="pre">add</span></code> ufunc, which we have implemented
for the <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> prototype:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>In [10]: %timeit arr_object + arr_object
10.1 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit arr_stringdtype + arr_stringdtype
3.64 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype)
17.7 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</pre></div>
</div>
<p>As described below, we have already updated a fork of Pandas to use a prototype
version of <code class="docutils literal notranslate"><span class="pre">StringDType</span></code>. This demonstrates the performance improvements
available when data are already loaded into a NumPy array and are passed to a
third-party library. Currently Pandas attempts to coerce all <code class="docutils literal notranslate"><span class="pre">str</span></code> data to
<code class="docutils literal notranslate"><span class="pre">object</span></code> DType by default, and has to check and sanitize existing <code class="docutils literal notranslate"><span class="pre">object</span></code>
arrays that are passed in. This requires a copy or pass over the data made
unnecessary by first-class support for variable-width strings in both NumPy and
Pandas:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>In [13]: import pandas as pd
In [14]: %timeit pd.Series(arr_stringdtype)
18.8 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
</pre></div>
</div>
<p>If we force Pandas to use object string arrays, which was the default until very
recently, we see the substantial performance penalty of a pass over the data
outside of NumPy:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>In [15]: %timeit pd.Series(arr_object, dtype='string[python]')
907 µs ± 67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each
</pre></div>
</div>
<p>Pandas switched to PyArrow-backed string arrays by default specifically to avoid
this and other performance costs associated with object string arrays.</p>
</section>
</section>
<section id="backward-compatibility">
<h2>Backward compatibility<a class="headerlink" href="#backward-compatibility" title="Link to this heading">#</a></h2>
<p>We are not proposing a change to DType inference for python strings and do not
expect to see any impacts on existing usages of NumPy.</p>
</section>
<section id="detailed-description">
<h2>Detailed description<a class="headerlink" href="#detailed-description" title="Link to this heading">#</a></h2>
<p>Here we provide a detailed description of the version of <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> we
would like to include in NumPy. This is mostly identical to the prototype, but
has a few differences that are impossible to implement in a DType that lives
outside of NumPy.</p>
<p>First, we describe the Python API for instantiating <code class="docutils literal notranslate"><span class="pre">StringDType</span></code>
instances. Next, we will describe the missing data handling support and support
for strict string type checking for array elements. We next discuss the cast and
ufunc implementations we will define and discuss our plan for a new
<code class="docutils literal notranslate"><span class="pre">np.strings</span></code> namespace to directly expose string ufuncs in the Python
API. Finally, we provide an overview of the C API we would like to expose and
the details of the memory layout and heap allocation strategy we have chosen for
the initial implementation.</p>
<section id="python-api-for-stringdtype">
<h3>Python API for <code class="docutils literal notranslate"><span class="pre">StringDType</span></code><a class="headerlink" href="#python-api-for-stringdtype" title="Link to this heading">#</a></h3>
<p>The new DType will be accessible via the <code class="docutils literal notranslate"><span class="pre">np.dtypes</span></code> namespace:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="kn">from</span><span class="w"> </span><span class="nn">numpy.dtypes</span><span class="w"> </span><span class="kn">import</span> <span class="n">StringDType</span>
<span class="gp">>>> </span><span class="n">dt</span> <span class="o">=</span> <span class="n">StringDType</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">dt</span>
<span class="go">numpy.dtypes.StringDType()</span>
</pre></div>
</div>
<p>In addition, we propose reserving the character <code class="docutils literal notranslate"><span class="pre">"T"</span></code> (short for text) for
usage with <code class="docutils literal notranslate"><span class="pre">np.dtype</span></code>, so the above would be identical to:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">dtype</span><span class="p">(</span><span class="s2">"T"</span><span class="p">)</span>
<span class="go">numpy.dtypes.StringDType()</span>
</pre></div>
</div>
<p><code class="docutils literal notranslate"><span class="pre">StringDType</span></code> can be used out of the box to represent strings of arbitrary
length in a NumPy array:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"this is a very long string"</span><span class="p">,</span> <span class="s2">"short string"</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">StringDType</span><span class="p">())</span>
<span class="gp">>>> </span><span class="n">arr</span>
<span class="go">array(['this is a very long string', 'short string'], dtype=StringDType())</span>
</pre></div>
</div>
<p>Note that unlike fixed-width strings, <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> is not parameterized by
the maximum length of an array element, arbitrarily long or short strings can
live in the same array without needing to reserve storage for padding bytes in
the short strings.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> class will be a synonym for the default <code class="docutils literal notranslate"><span class="pre">StringDType</span></code>
instance when the class is passed as a <code class="docutils literal notranslate"><span class="pre">dtype</span></code> argument in the NumPy Python
API. We have already converted most of the API surface to work like this, but
there are still a few spots that have not yet been converted and it’s likely
third-party code has not been converted, so we will not emphasize this in the
docs. Emphasizing that <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> is a class and <code class="docutils literal notranslate"><span class="pre">StringDType()</span></code> is an
instance is a more forward-looking API that the rest of the NumPy DType API can
move towards now that DType classes are importable from the <code class="docutils literal notranslate"><span class="pre">np.dtypes</span></code>
namespace, so we will include an explicit instantiation of a <code class="docutils literal notranslate"><span class="pre">StringDType</span></code>
object in the documentation even if it is not strictly necessary.</p>
<p>We propose associating the python <code class="docutils literal notranslate"><span class="pre">str</span></code> builtin as the DType’s scalar type:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">StringDType</span><span class="o">.</span><span class="n">type</span>
<span class="go"><class 'str'></span>
</pre></div>
</div>
<p>While this does create an API wart in that the mapping from builtin DType
classes to scalars in NumPy will no longer be one-to-one (the <code class="docutils literal notranslate"><span class="pre">unicode</span></code>
DType’s scalar type is <code class="docutils literal notranslate"><span class="pre">str</span></code>), this avoids needing to define, optimize, or
maintain a <code class="docutils literal notranslate"><span class="pre">str</span></code> subclass for this purpose or other hacks to maintain this
one-to-one mapping. To maintain backward compatibility, the DType detected for a
list of python strings will remain a fixed-width unicode string.</p>
<p>As described below, <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> supports two parameters that can adjust the
runtime behavior of the DType. We will not attempt to support parameters for the
dtype via a character code. If users need an instance of the DType that does not
use the default parameters, they will need to instantiate an instance of the
DType using the DType class.</p>
<p>We will also extend the <code class="docutils literal notranslate"><span class="pre">NPY_TYPES</span></code> enum in the C API with an <code class="docutils literal notranslate"><span class="pre">NPY_VSTRING</span></code>
entry (there is already an <code class="docutils literal notranslate"><span class="pre">NPY_STRING</span></code> entry). This should not interfere with
legacy user-defined DTypes since the integer type numbers for these data types
begin at 256. In principle there is still room for hundreds more builtin
DTypes in the integer range available in the <code class="docutils literal notranslate"><span class="pre">NPY_TYPES</span></code> enum.</p>
<p>In principle we do not need to reserve a character code and there is a desire to
move away from character codes. However, a substantial amount of downstream code
relies on checking DType character codes to discriminate between builtin NumPy
DTypes, and we think it would harm adoption to require users to refactor their
DType-handling code if they want to use <code class="docutils literal notranslate"><span class="pre">StringDType</span></code>.</p>
<p>We also hope that in the future we might be able to add a new fixed-width text
version of <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> that can re-use the <code class="docutils literal notranslate"><span class="pre">"T"</span></code> character code with
length or encoding modifiers. This will allow a migration to a more flexible
text dtype for use with structured arrays and other use-cases with a fixed-width
string is a better fit than a variable-width string.</p>
</section>
<section id="missing-data-support">
<h3>Missing Data Support<a class="headerlink" href="#missing-data-support" title="Link to this heading">#</a></h3>
<p>Missing data can be represented using a sentinel:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">dt</span> <span class="o">=</span> <span class="n">StringDType</span><span class="p">(</span><span class="n">na_object</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">"hello"</span><span class="p">,</span> <span class="n">nan</span><span class="p">,</span> <span class="s2">"world"</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dt</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">arr</span>
<span class="go">array(['hello', nan, 'world'], dtype=StringDType(na_object=nan))</span>
<span class="gp">>>> </span><span class="n">arr</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="go">nan</span>
<span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">isnan</span><span class="p">(</span><span class="n">arr</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="go">True</span>
<span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">isnan</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="go">array([False, True, False])</span>
<span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dt</span><span class="p">)</span>
<span class="go">array(['', '', ''])</span>
</pre></div>
</div>
<p>We only propose supporting user-provided sentinels. By default, empty arrays
will be populated with empty strings:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">StringDType</span><span class="p">())</span>
<span class="go">array(['', '', ''], dtype=StringDType())</span>
</pre></div>
</div>
<p>By only supporting user-provided missing data sentinels, we avoid resolving
exactly how NumPy itself should support missing data and the correct semantics
of the missing data object, leaving that up to users to decide. However, we <em>do</em>
detect whether the user is providing a NaN-like missing data value, a string
missing data value, or neither. We explain how we handle these cases below.</p>
<p>A cautious reader may be worried about the complexity of needing to handle three
different categories of missing data sentinel. The complexity here is reflective
of the flexibility of object arrays and the downstream usage patterns we’ve
found. Some users want comparisons with the sentinel to error, so they use
<code class="docutils literal notranslate"><span class="pre">None</span></code>. Others want comparisons to succeed and have some kind of meaningful
ordering, so they use some arbitrary, hopefully unique string. Other users want
to use something that acts like NaN in comparisons and arithmetic or is
literally NaN so that NumPy operations that specifically look for exactly NaN
work and there isn’t a need to rewrite missing data handling outside of
NumPy. We believe it is possible to support all this, but it requires a bit of
hopefully manageable complexity.</p>
<section id="nan-like-sentinels">
<h4>NaN-like Sentinels<a class="headerlink" href="#nan-like-sentinels" title="Link to this heading">#</a></h4>
<p>A NaN-like sentinel returns itself as the result of arithmetic operations. This
includes the python <code class="docutils literal notranslate"><span class="pre">nan</span></code> float and the Pandas missing data sentinel
<code class="docutils literal notranslate"><span class="pre">pd.NA</span></code>. We choose to make NaN-like sentinels inherit these behaviors in
operations, so the result of addition is the sentinel:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">dt</span> <span class="o">=</span> <span class="n">StringDType</span><span class="p">(</span><span class="n">na_object</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">"hello"</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">,</span> <span class="s2">"world"</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dt</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">arr</span> <span class="o">+</span> <span class="n">arr</span>
<span class="go">array(['hellohello', nan, 'worldworld'], dtype=StringDType(na_object=nan))</span>
</pre></div>
</div>
<p>We also chose to make a NaN-like sentinel sort to the end of the array,
following the behavior of sorting an array containing <code class="docutils literal notranslate"><span class="pre">nan</span></code>.</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="go">array(['hello', 'world', nan], dtype=StringDType(na_object=nan))</span>
</pre></div>
</div>
</section>
<section id="string-sentinels">
<h4>String Sentinels<a class="headerlink" href="#string-sentinels" title="Link to this heading">#</a></h4>
<p>A string missing data value is an instance of <code class="docutils literal notranslate"><span class="pre">str</span></code> or subtype of <code class="docutils literal notranslate"><span class="pre">str</span></code>.</p>
<p>Operations will use the sentinel value directly for missing entries. This is the
primary usage of this pattern we’ve found in downstream code, where a missing
data sentinel like <code class="docutils literal notranslate"><span class="pre">"__nan__"</span></code> is passed to a low-level sorting or
partitioning algorithm.</p>
</section>
<section id="other-sentinels">
<h4>Other Sentinels<a class="headerlink" href="#other-sentinels" title="Link to this heading">#</a></h4>
<p>Any other python object will raise errors in operations or comparisons, just as
<code class="docutils literal notranslate"><span class="pre">None</span></code> does as a missing data sentinel for object arrays currently:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">dt</span> <span class="o">=</span> <span class="n">StringDType</span><span class="p">(</span><span class="n">na_object</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">"hello"</span><span class="p">,</span> <span class="kc">None</span><span class="p">,</span> <span class="s2">"world"</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dt</span><span class="p">))</span>
<span class="go">ValueError: Cannot compare null that is not a string or NaN-like value</span>
</pre></div>
</div>
<p>Since comparisons need to raise an error, and the NumPy comparison API has no
way to signal value-based errors during a sort without holding the GIL, sorting
arrays that use arbitrary missing data sentinels will hold the GIL. We may also
attempt to relax this restriction by refactoring NumPy’s comparison and sorting
implementation to allow value-based error propagation during a sort operation.</p>
</section>
<section id="implications-for-dtype-inference">
<h4>Implications for DType Inference<a class="headerlink" href="#implications-for-dtype-inference" title="Link to this heading">#</a></h4>
<p>If, in the future, we decide to break backward compatibility to make
<code class="docutils literal notranslate"><span class="pre">StringDType</span></code> the default DType for <code class="docutils literal notranslate"><span class="pre">str</span></code> data, the support for arbitrary
objects as missing data sentinels may seem to pose a problem for implementing
DType inference. However, given that initial support for this DType will require
using the DType directly and will not be able to rely on NumPy to infer the
DType, we do not think this will be a major problem for downstream users of the
missing data feature. To use <code class="docutils literal notranslate"><span class="pre">StringDType</span></code>, they will need to update
their code to explicitly specify a DType when an array is created, so if NumPy
changes DType inference in the future, their code will not change behavior and
there will never be a need for missing data sentinels to participate in DType
inference.</p>
</section>
</section>
<section id="coercing-non-strings">
<h3>Coercing non-strings<a class="headerlink" href="#coercing-non-strings" title="Link to this heading">#</a></h3>
<p>By default, non-string data are coerced to strings:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="nb">object</span><span class="p">(),</span> <span class="mf">3.4</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">StringDType</span><span class="p">())</span>
<span class="go">array(['1', '<object object at 0x7faa2497dde0>', '3.4'], dtype=StringDType())</span>
</pre></div>
</div>
<p>If this behavior is not desired, an instance of the DType can be created that
disables string coercion:</p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="nb">object</span><span class="p">(),</span> <span class="mf">3.4</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">StringDType</span><span class="p">(</span><span class="n">coerce</span><span class="o">=</span><span class="kc">False</span><span class="p">))</span>
<span class="gt">Traceback (most recent call last):</span>
File <span class="nb">"<stdin>"</span>, line <span class="m">1</span>, in <span class="n"><module></span>
<span class="gr">ValueError</span>: <span class="n">StringDType only allows string data when string coercion</span>
<span class="x">is disabled</span>
</pre></div>
</div>
<p>This allows strict data validation in the same pass over the data NumPy uses to
create the array without a need for downstream libraries to implement their own
string validation in a separate, expensive, pass over the input array-like. We
have chosen not to make this the default behavior to follow NumPy fixed-width
strings, which coerce non-strings.</p>
</section>
<section id="casts-ufunc-support-and-string-manipulation-functions">
<h3>Casts, ufunc support, and string manipulation functions<a class="headerlink" href="#casts-ufunc-support-and-string-manipulation-functions" title="Link to this heading">#</a></h3>
<p>A full set of round-trip casts to the builtin NumPy DTypes will be available. In
addition, we will add implementations for the comparison operators as well as an
<code class="docutils literal notranslate"><span class="pre">add</span></code> loop that accepts two string arrays, <code class="docutils literal notranslate"><span class="pre">multiply</span></code> loops that accept
string and integer arrays, an <code class="docutils literal notranslate"><span class="pre">isnan</span></code> loop, and implementations for the
<code class="docutils literal notranslate"><span class="pre">str_len</span></code>, <code class="docutils literal notranslate"><span class="pre">isalpha</span></code>, <code class="docutils literal notranslate"><span class="pre">isdecimal</span></code>, <code class="docutils literal notranslate"><span class="pre">isdigit</span></code>, <code class="docutils literal notranslate"><span class="pre">isnumeric</span></code>,
<code class="docutils literal notranslate"><span class="pre">isspace</span></code>, <code class="docutils literal notranslate"><span class="pre">find</span></code>, <code class="docutils literal notranslate"><span class="pre">rfind</span></code>, <code class="docutils literal notranslate"><span class="pre">count</span></code>, <code class="docutils literal notranslate"><span class="pre">strip</span></code>, <code class="docutils literal notranslate"><span class="pre">lstrip</span></code>, <code class="docutils literal notranslate"><span class="pre">rstrip</span></code>,
and <code class="docutils literal notranslate"><span class="pre">replace</span></code> string ufuncs that will be newly available in NumPy 2.0.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">isnan</span></code> ufunc will return <code class="docutils literal notranslate"><span class="pre">True</span></code> for entries that are NaN-like sentinels
and <code class="docutils literal notranslate"><span class="pre">False</span></code> otherwise. Comparisons will sort data in order of unicode code
point, as is currently implemented for the fixed-width unicode DType. In the
future NumPy or a downstream library may add locale-aware sorting, case folding,
and normalization for NumPy unicode strings arrays, but we are not proposing
adding these features at this time.</p>
<p>Two <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> instances are considered equal if they are created with the
same <code class="docutils literal notranslate"><span class="pre">na_object</span></code> and <code class="docutils literal notranslate"><span class="pre">coerce</span></code> parameter. For ufuncs that accept more than
one string argument we also introduce the concept of “compatible”
<code class="docutils literal notranslate"><span class="pre">StringDType</span></code> instances. We allow distinct DType instances to be used in ufunc
operations together if have the same <code class="docutils literal notranslate"><span class="pre">na_object</span></code> or if only one
or the other DType has an <code class="docutils literal notranslate"><span class="pre">na_object</span></code> explicitly set. We do not consider
string coercion for determining whether instances are compatible, although if
the result of the operation is a string, the result will inherit the stricter
string coercion setting of the original operands.</p>
<p>This notion of “compatible” instances will be enforced in the
<code class="docutils literal notranslate"><span class="pre">resolve_descriptors</span></code> function of binary ufuncs. This choice makes it easier
to work with non-default <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> instances, because python strings are
coerced to the default <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> instance, so the following idiomatic
expression is allowed:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s2">"hello"</span><span class="p">,</span> <span class="s2">"world"</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">StringDType</span><span class="p">(</span><span class="n">na_object</span><span class="o">=</span><span class="kc">None</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">arr</span> <span class="o">+</span> <span class="s2">"!"</span>
<span class="go">array(['hello!', 'world!'], dtype=StringDType(na_object=None))</span>
</pre></div>
</div>
<p>If we only considered equality of <code class="docutils literal notranslate"><span class="pre">StringDType</span></code> instances, this would
be an error, making for an awkward user experience. If the operands have