|
| 1 | +--- |
| 2 | +site: |
| 3 | + hide_toc: true |
| 4 | + hide_footer_links: true |
| 5 | +--- |
| 6 | + |
| 7 | +# 2024 Landscape Analysis |
| 8 | + |
| 9 | +Python is widely adopted in data science, and its use for statistics is expanding rapidly---particularly in education and applied research. |
| 10 | +The statistical ecosystem in Python is currently anchored by six major libraries: |
| 11 | + |
| 12 | +- [numpy](https://www.numpy.org/), which provides fast, flexible array and numerical operations, and underpins nearly all statistical and scientific computing in Python. |
| 13 | + It supports descriptive statistics, correlation and covariance computations, random sampling, and tools for constructing histograms and binning data. |
| 14 | +- [pandas](https://www.pandas.org/), which offers intuitive, high-performance data structures for tabular and time series data, making data cleaning, wrangling, reshaping, aggregation, and exploratory analysis straightforward and efficient. |
| 15 | +- [scipy](https://www.scipy.org/), which builds on NumPy to deliver a broad range of scientific and statistical functionality---including, in its [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/stats.html) submodule, a comprehensive suite of probability distributions, summary statistics, and basic statistical tests. |
| 16 | + It also provides modules for clustering, optimization, interpolation, and signal processing. |
| 17 | +- [matplotlib](https://matplotlib.org/), the foundational plotting library in Python, which enables the creation of high-quality static, animated, and interactive visualizations, and serves as the basis for many higher-level plotting and statistical graphics libraries. |
| 18 | +- [statsmodels](https://www.statsmodels.org/), which offers tools for econometrics, classical statistics, and statistical modeling---including linear and generalized linear models, time series analysis, survival analysis, and hypothesis testing, with extensive support for model diagnostics and statistical inference. |
| 19 | +- [scikit-learn](https://scikit-learn.org/), which is best known for machine learning but also supports statistical modeling, offering a consistent API for regression, classification, clustering, model evaluation, statistical preprocessing, and dimensionality reduction. |
| 20 | + |
| 21 | +These core libraries are generally well-tested, reliable, and uphold high software engineering standards, making them trusted foundations for research and application. |
| 22 | +They benefit from contributions not only from science users but also from methods and software developers. |
| 23 | +Libraries like scikit-learn are especially valued for their clean, consistent interfaces and their integration with the broader Python data stack, which streamlines workflows and enhances usability for both new and experienced users. |
| 24 | + |
| 25 | +While there are many smaller, specialized packages available, the ecosystem remains dominated by these large, general-purpose libraries. |
| 26 | +This concentration of resources ensures stability and quality but can also limit the visibility and adoption of innovative or niche statistical tools. |
| 27 | +As Python's role in statistics continues to grow, fostering a more diverse and accessible ecosystem will be key to meeting the evolving needs of educators, researchers, and practitioners. |
| 28 | +This will also require increased participation from statistics methods developers in the core packages. |
| 29 | + |
| 30 | +# Relationship to Other Languages |
| 31 | + |
| 32 | +R remains the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources. |
| 33 | +R's [tidyverse](https://www.tidyverse.org/) and [RStudio](https://posit.co/products/open-source/rstudio/) provide a smoother and more cohesive user experience for statistics, and CRAN offers a vast repository of statistical packages. |
| 34 | +The R ecosystem also benefits from substantial contributions from statistics methods developers. |
| 35 | + |
| 36 | +:::{table} Python vs. R for Statistics |
| 37 | +:label: table |
| 38 | +:align: center |
| 39 | + |
| 40 | +| Aspect | Python (Scientific Python) | R (CRAN, tidyverse) | |
| 41 | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | |
| 42 | +| Core Libraries | [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), [statsmodels](https://www.statsmodels.org/), [scikit-learn](https://scikit-learn.org/) | [base R](https://www.r-project.org/), [tidyverse](https://www.tidyverse.org/), many CRAN packages | |
| 43 | +| User Experience | Fragmented, less cohesive | Cohesive, tidyverse pipelines, RStudio | |
| 44 | +| Teaching Resources | Improving, but less abundant | Extensive, beginner-friendly | |
| 45 | +| Community | Large, but less connected in statistics | Strong, statistics-focused, welcoming | |
| 46 | +| Package Development | High barriers, less modularity | Easy, many small packages, dev tools | |
| 47 | +| Interoperability | Needs improvement (data structures, APIs) | Strong within tidyverse, RStudio | |
| 48 | +| Branding | Data science/machine learning focus | Statistics-focused | |
| 49 | + |
| 50 | +::: |
| 51 | + |
| 52 | +**Interoperability**: While some users switch between Python and R in their workflows, true interoperability is limited. |
| 53 | +Most projects use one language at a time, though it is common to leverage R for data manipulation and Python for modeling, or vice versa. |
| 54 | + |
| 55 | +**Other Platforms**: Tools like GraphPad Prism remain popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates all applied domains. |
| 56 | + |
| 57 | +# Weaknesses and Needs |
| 58 | + |
| 59 | +Despite Python's strengths, several challenges remain. |
| 60 | + |
| 61 | +- **Fragmentation**: The ecosystem is fragmented, with major libraries (e.g., statsmodels vs. scikit-learn) adopting incompatible APIs and workflows, leading to confusion for users and students. |
| 62 | +- **User Experience**: There is no central landing place or unified entry point for statistics in Python, unlike R's [tidyverse](https://www.tidyverse.org/) or RStudio, making it harder for newcomers to get started. |
| 63 | +- **Interoperability**: Data structures (such as those from [pandas](https://pandas.pydata.org/) and [NumPy](https://numpy.org/)) do not always work seamlessly across libraries, requiring conversions and leading to unpredictable function outputs compared to R's tidyverse pipelines. |
| 64 | + Moreover some statistical methods use the results of other statistical subroutines (e.g., a multiple testing adjustment might be applied to the results of a number of different tests). |
| 65 | + At the moment there is limited support for putting statistical methods together as subroutines. |
| 66 | +- **Teaching Resources**: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community. |
| 67 | +- **Contributor Barriers**: Contributing to core libraries can be difficult due to high standards and lack of modularity. |
| 68 | + Small, specialized packages exist but are less visible and less widely used than in R. |
| 69 | +- **Statistical Methods Coverage**: Support for basic methods could be improved; moreover, Python's advanced or niche statistical methodology support generally falls behind R's vast [CRAN](https://cran.r-project.org/) repository. |
| 70 | +- **Comprehensive tooling for statistical analysis**: Data analysts using statistical methods need more than just the `p`-value for a statistical test or coefficient for a regression model. |
| 71 | + There are well-established numerical and visual diagnostics that accompany many statistical methods, but typically have limited support in existing packages. |
| 72 | + Moreover, analysts need to communicate their results through a variety of mediums and there is often minimal communication support built into Python statistical software. |
| 73 | +- **Abstracting the core computation from the statistical methodology**: Many computations required in statistics (e.g. solving the optimization problem associated with a generalized linear model) have a variety of algorithmic options. |
| 74 | + While most statistical packages implement one (or a couple of) algorithms, there is rarely one "right" algorithm for every scenario. |
| 75 | + Depending on the size of the data, available hardware, analysis needs, etc., there can be multiple algorithms an analyst might want to use. |
| 76 | + Many Python statistical software packages tightly couple the core computation with the rest of the methodology, which makes it difficult to provide better computational approaches. |
| 77 | +- **Community and Culture**: The Python statistics community is less cohesive and connected than R's, which benefits from a strong identity and established events. |
| 78 | + |
| 79 | +# Conclusion |
| 80 | + |
| 81 | +Python's statistics ecosystem is powerful but fragmented, with significant opportunities for improvement in usability, interoperability, teaching resources, and community cohesion. |
| 82 | +While R remains the default for statistics, Python is gaining ground, especially as data science and machine learning continue to grow in influence. |
| 83 | +Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain. |
| 84 | +In particular, Python needs: |
| 85 | + |
| 86 | +- A unified, user-friendly interface for statistics, possibly modeled after scikit-learn. |
| 87 | +- Improved interoperability between core data structures and libraries. |
| 88 | +- More accessible teaching resources and case studies focused on statistics. |
| 89 | +- Lower barriers for contributors and greater visibility for specialized statistical packages. |
| 90 | +- Stronger community identity and central organization for statistics in Python. |
| 91 | + |
| 92 | +The Statistical Python project seeks to address these needs by fostering collaboration, sharing best practices, and building a sustainable, open community. |
| 93 | +As a domain stack within the [Scientific Python project](https://scientific-python.org/), and with support from the NSF POSE Phase I grant, we are committed to making Python a premier platform for statistical computing, education, and research. |
0 commit comments