library(reticulate)
Using Python and R Together
In the ever-increasing requirements era of data science, Python is one of the fundamental tools that a data scientist should know about. Even though R is quite elegant and enough in many data related applications, a mix of Python might also be required to get the job done. One of the reasons is to benefit from API connections because Python enjoys first class support (e.g. native SDK) in many services. For instance, using boto3
package for AWS was quite useful (before the paws R package).
RStudio especially positions itself as “A Single Home for R and Python Data Science” in their blog post. It is also the main reason why they created and support the reticulate
R package. Before reticulate
, Python integration was still possible but much more difficult in many aspects. It still might have quirks but reticulate
provides much better integration.
In this tutorial, we are going to do simple examples using reticulate
package. This tutorial is not exhaustive. For better coverage, check the official package page.
Initialization
It always starts with the loading of the package.
Python has multiple versions (thankfully they phased out Python 2 but still a pain in many operating systems) You can learn the current python version PATH using Sys.which
function. Output will differ in different systems and installations.
Sys.which("python")
python
"/Users/rocket/.pyenv/shims/python"
For better detail, py_config
is a good function. Output will differ in different systems and installations.
py_config()
python: /Users/rocket/.virtualenvs/r-reticulate/bin/python
libpython: /Users/rocket/.pyenv/versions/3.11.6/lib/libpython3.11.dylib
pythonhome: /Users/rocket/.virtualenvs/r-reticulate:/Users/rocket/.virtualenvs/r-reticulate
version: 3.11.6 (main, Nov 23 2023, 13:24:35) [Clang 14.0.0 (clang-1400.0.29.202)]
numpy: /Users/rocket/.virtualenvs/r-reticulate/lib/python3.11/site-packages/numpy
numpy_version: 1.26.2
For available configurations py_discover_config
can be used.
py_discover_config()
reticulate
can use other python installations, virtual environments and conda versions. Although it is a great convenience, intricacies and delicateness of Python versions might still hurt your workflows.
use_python() ## python path
use_virtualenv() ## virtual environment name
use_condaenv() ## conda environment
Methods and Examples
In this section we will see some methodologies to use with reticulate
.
Call Python functions R style
This is the fundamental and (in my opinion) worst way to benefit from Python in R. Because translation of style is imperfect and might not work in every case.
Here is an example with pandas
. Install pandas
if not installed.
py_install("pandas")
## similar to import pandas as pd
<- import("pandas")
pd ## create a simple dataframe
<- pd$DataFrame(data = list(col1 = c(2, 1, 3), col2 = c("a", "b", "c")))
df_pandas
df_pandas
col1 col2
1 2 a
2 1 b
3 3 c
Let’s try a simple example.
<- import("os")
os $path$join("a", "b", "c") os
[1] "a/b/c"
Caveats: Some functions working on console might not work on RMarkdown
try(df_pandas$sort_values("col1"))
Error in try(df_pandas$sort_values("col1")) :
attempt to apply non-function
## filter using query fails both on console and rmarkdown
try(df_pandas$query("col1 > 1.0"))
Error in try(df_pandas$query("col1 > 1.0")) :
attempt to apply non-function
Call Python in RMarkdown chunk
The format is similar to an R chunk but instead of r
write python
.
```{python}
#your python code here
```
Here is an example
import os
"a","b","c") os.path.join(
'a/b/c'
Our pandas examples also works in this case.
import pandas as pd
= pd.DataFrame(data={'col1':[2,1,3],'col2':['a','b','c']})
pandas_df_py
pandas_df_py
col1 col2
0 2 a
1 1 b
2 3 c
'col1') pandas_df_py.sort_values(
col1 col2
1 1 b
0 2 a
2 3 c
## filter using query fails both on console and rmarkdown
'col1 > 1.0') pandas_df_py.query(
col1 col2
0 2 a
2 3 c
Source Python Script
Personally, the most convenient way to use Python code is to source a .py
file. Even though interoperability is a great idea, keeping Python and R codes as separate as possible will save you from a lot of headache in the future (at least with the current implementation).
We can source any python file using source_python()
function easily. Let’s name our python file triangle.py
and write the following.
def area_of_triangle(h,x):
return h*x/2
3,5) area_of_triangle(
7.5
Follow
::source_python("triangle.py") ### Remember to provide proper relative or absolute path. reticulate
Note: For RMarkdown demonstration purposes below we;
- created a temporary file
- wrote our python function to that file using
cat
- then executed
source_python
- then called the function as R function
<- tempfile(fileext = ".py")
pyfile cat("def area_of_triangle(h,x):
return h*x/2", file = pyfile)
source_python(pyfile)
area_of_triangle(3, 5)
[1] 7.5
But deep down it is a Python function.
print(area_of_triangle)
<function area_of_triangle at 0x13f314f40>
Conclusion
To be honest Python versioning is a mess. Lots of parallel versions (even python2 version issues are still looming) and conflicts in different layers of computing might be troublesome especially in Docker settings. Therefore it is highly recommended to add Python code to your R code if it is totally necessary.
Though it is always a good thing to have an exquisite tool if you need to use Python and cannot separate processes. reticulate
is currently that tool.