Ophel.ia

This API builds data mining & ml pipelines with pyspark

Download as .zip Download as .tar.gz View on GitHub

Ophelia Hamlet’s beloved beautiful woman (and is the name of the package too), is known because of her madness and immortal love for Hamlet; but Shakespeare’s entire master piece does not do justice to her magnificent character. Ophelia is the epitome of goodness, brightness, and the elegance of simplicity.

Motivations πŸš€

As Data Scientists or Data Analysts, we don’t really want to waste too much time guessing how PySpark’s framework may be used. Sometimes we just want a prompt answer instead of a full nice code. With that in mind, this project aims to help reduce the complexity of the analytical lifecycle for everyone who uses PySpark frequently.

Now is the time of a new, smart, and very extravagant Ophelia to help us optimize the learning curve involved in PySpark’s most common functionality, offering features such as:

Getting Started:

Requirements πŸ“œ

Before starting, you’ll need to have installed pyspark >= 3.0.x, pandas >= 1.1.3, numpy >= 1.19.1, dask >= 2.30.x, scikit-learn >= 0.23.x Additionally, if you want to use the Ophelia API, you’ll also need Python (supported 3.7 and 3.8 versions) and pip installed.

Building from source πŸ› οΈ

Just clone the Ophelia repo and import Ophelia:

git clone https://github.com/LuisFalva/ophilea.git

To initialize Ophelia with Spark embedded session use:

>>> from ophelia.start import Ophelia
>>> ophelia = Ophelia("Set Your Own Spark App Name")
>>> sc = ophelia.Spark.build_spark_context()

13:17:48.840 Ophelia [TAPE] +---------------------------------------------------------------+
13:17:48.840 Ophelia [INFO] | Hello! This API builds data mining & ml pipelines with pyspark|
13:17:48.840 Ophelia [INFO] | Welcome to Ophelia pyspark miner engine                       |
13:17:48.840 Ophelia [INFO] | Lib Version ophelia.0.1.dev0                                  |
13:17:48.840 Ophelia [TAPE] +---------------------------------------------------------------+
13:17:48.840 Ophelia [WARN]                      - Ophilea Gentleman Org -            
13:17:48.840 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.840 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–ˆ β–ˆ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ β–“ β–“ β–“ β–“ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–“ β–“ β–“ β–“ ╬ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ β–“ β–“ β–“ β–“ β–“ β–“ ╬ ╬ β–ˆ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ β–ˆ ╬ ╬ β–“ β–“ β–“ β–“ β–“ β–“ ╬ ╬ β–ˆ 
13:17:48.841 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ β–“ β–“ β–“ β–“ ╬ ╬ β–ˆ β–ˆ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ β–ˆ β–ˆ ╬ ╬ β–“ β–“ β–“ β–“ ╬ ╬ ╬ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ ╬ ╬ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ ╬ ╬ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ ╬ ╬ β–“ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ ╬ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–“ ╬ ╬ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ ╬ ╬ β–“ β–“ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–“ β–“ ╬ ╬ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ β–“ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ╬ ╬ ╬ ╬ β–ˆ ╬ ╬ ╬ ╬ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
13:17:48.842 Ophelia [MASK]   β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ 
                                                              
13:17:48.843 Ophelia [WARN] Initializing Spark Session
13:17:58.062 Ophelia [INFO] Spark Version: 3.0.0
13:17:58.063 Ophelia [INFO] This Is: 'Set Your Own Spark App Name' App
13:17:58.063 Ophelia [INFO] Spark Context Initialized Success

Main class objects provided by initializing Ophelia session:

Let me show you some application examples:

The Read class implements Spark reading object in multiple formats {'csv', 'parquet', 'excel', 'json'}

>>> from ophelia.read.spark_read import Read
>>> spark_df = spark.readFile(path, 'csv', header=True, infer_schema=True)

Also, you may import class Shape from factory functions in order to see the dimension of our spark DataFrame such as numpy style.

>>> from ophelia.functions import Shape
>>> dic = {
    'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Year': [2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2012],
    'Revenue': [100, 200, 300, 110, 190, 320, 120, 220, 350]
}
>>> dic_to_df = spark.createDataFrame(pd.DataFrame(data=dic))
>>> dic_to_df.show(10, False)

+-------+----+-------+
|Product|Year|Revenue|
+-------+----+-------+
|A      |2010|100    |
|B      |2010|200    |
|C      |2010|300    |
|A      |2011|110    |
|B      |2011|190    |
|C      |2011|320    |
|A      |2012|120    |
|B      |2012|220    |
|C      |2012|350    |
+-------+----+-------+

>>> dic_to_df.Shape
(9, 3)

The pct_change wrapper is added to the Spark DataFrame class in order to have the most commonly used method in Pandas objects to get the relative percentage change from one observation to another, sorted by a date-type column and lagged by a numeric-type column.

>>> from ophelia.functions import PctChange
>>> dic_to_df.pctChange().show(10, False)

+-------------------+
|Revenue            |
+-------------------+
|null               |
|1.0                |
|0.5                |
|-0.6333333333333333|
|0.7272727272727273 |
|0.6842105263157894 |
|-0.625             |
|0.8333333333333333 |
|0.5909090909090908 |
+-------------------+

Another option is to configure all receiving parameters from the function, as follows:

In this case, we will specify only the periods parameter to yield a lag of -2 days over the DataFrame.

>>> dic_to_df.pctChange(periods=2).na.fill(0).show(5, False)

+--------------------+
|Revenue             |
+--------------------+
|0.0                 |
|0.0                 |
|2.0                 |
|-0.44999999999999996|
|-0.3666666666666667 |
+--------------------+
only showing top 5 rows

Adding parameters: partition_by, order_by & pct_cols

>>> dic_to_df.pctChange(partition_by="Product", order_by="Year", pct_cols="Revenue").na.fill(0).show(5, False)

+---------------------+
|Revenue              |
+---------------------+
|0.0                  |
|-0.050000000000000044|
|0.1578947368421053   |
|0.0                  |
|0.06666666666666665  |
+---------------------+
only showing top 5 rows

You may also lag more than one column at a time by simply adding a list with string column names:

>>> dic_to_df.pctChange(partition_by="Product", order_by="Year", pct_cols=["Year", "Revenue"]).na.fill(0).show(5, False)

+--------------------+---------------------+
|Year                |Revenue              |
+--------------------+---------------------+
|0.0                 |0.0                  |
|4.975124378110429E-4|-0.050000000000000044|
|4.972650422674363E-4|0.1578947368421053   |
|0.0                 |0.0                  |
|4.975124378110429E-4|0.06666666666666665  |
+--------------------+---------------------+
only showing top 5 rows

Want to contribute? πŸ€”

Bring it on! If you have an idea or want to ask anything, or there is a bug you want fixed, you may open an issue ticket. You will find the guidelines to make an issue request there. Also, you can get a glimpse of Open Source Contribution Guide best practices here. Cheers 🍻!

Support or Contact πŸ“ 

Having trouble with Ophilea? Yo can DM me at falvaluis@gmail.com and I’ll help you sort it out.

License πŸ“ƒ

Released under the Apache License, version 2.0.