DataFrame Documentation
This is the official documentation for the DataFrame API provided by Raven Computing. This page unifies the documentation for all supported programming languages and gives examples on how to use the API. The specification strictly defines the behaviour of DataFrames as an in-memory data structure as well as a platform-independent file format for persistence. The objective is to provide a unified interface and the same experience for working with strongly typed tabular data in different languages.
Table of Contents
- Core Concepts
- Getting Started
- DataFrame API
Core Concepts
This section describes the core concepts of DataFrames. The specification defines two components which are referred to as being a DataFrame. On the one hand, a DataFrame denotes a data structure used by programs to handle data in memory in a specific way. On the other hand, a DataFrame is also specified as a file format, recognizable by the '.df' file extension. The following paragraphs will explain in more detail how the specification defines DataFrames.
What are DataFrames?
First and foremost, a DataFrame is an in-memory data structure. Originally popularized by the programming language R where DataFrames are built-in, other languages have in the meantime provided support for DataFrames as well in the form of external libraries. A DataFrame can be thought of as a table of data, similar to a spreadsheet. This way of organizing data is very useful in computer science and software engineering since it makes it easy for humans to understand the content as two-dimensional data. One might think of the rows as individual data points, i.e. observations within the underlying dataset, and the columns as the individual variables of each observation. The columns are also often referred to as features. One might therefore think of a DataFrame as a collection of feature vectors since the rows correspond to the entries in each feature vector at a specific index. When the dataset is cleanly structured, each feature has a specific predetermined type. Therefore, the feature vectors, i.e. columns, also have a specific type because each vector holds data of one and only one specific type.
In essence, a DataFrame is simply a collection of columns (feature vectors). As a data structure it provides methods to manage those columns and both query and manipulate individual data entries. In the object-oriented way of programming, a DataFrame is an object that binds strongly typed columns into one data structure and provides methods to work with that data during program execution.
In addition to the in-memory usage, a DataFrame can be persisted to the file system. The structure of the files read and written is strictly defined. All information in a DataFrame object when used in-memory is serializable and can therefore be written to the filesystem. Any process that knows how to read a DataFrame file can do so and thus load the persisted DataFrame object back into memory whenever desired.
Types
Depending on the underlying programming language, a DataFrame object is represented by a DataFrame interface or class. DataFrames can therefore be referenced in code by variables that have the type DataFrame. In dynamically typed languages (e.g. Python) the variable type is omitted. Every column inside a DataFrame is represented by an object of type Column. It is an abstract class defining methods that all concrete DataFrame columns must implement. The actual column data (internally used array) is handled by a specific implementation of the Column abstract class. The specification demands that a DataFrame can work with 10 different element types. This is manifested by the presence of a separate Column implementation for each element type.
The following table lists all supported types:
Type Name | Element Type | Description | Implementations |
---|---|---|---|
byte | int8 | signed 8-bit integer | ByteColumn NullableByteColumn |
short | int16 | signed 16-bit integer | ShortColumn NullableShortColumn |
int | int32 | signed 32-bit integer | IntColumn NullableIntColumn |
long | int64 | signed 64-bit integer | LongColumn NullableLongColumn |
float | float32 | single precision 32-bit float | FloatColumn NullableFloatColumn |
double | float64 | double precision 64-bit float | DoubleColumn NullableDoubleColumn |
string | string | arbitrary-length unicode string | StringColumn NullableStringColumn |
char | uint8 | single printable ASCII-character | CharColumn NullableCharColumn |
boolean | bool | single boolean value | BooleanColumn NullableBooleanColumn |
binary | uint8 array | arbitrary-length byte array | BinaryColumn NullableBinaryColumn |
In the table above, the type name denotes the standardized name of the corresponding type. A Column implementation must holds data which is of the corresponding element type. For example, a ByteColumn holds signed 8-bit integers. It can therefore hold integer numbers in the range [-128, 127]. Every official DataFrame implementation must support the described types and provide an implementation for all corresponding Column classes.
Implementations
The specification describes two main DataFrame implementations. These implementations only differ in their treatment of null values. Both implementations provide the same API and overall behaviour.
A DefaultDataFrame is the implementation used by default. It works with primitives which means that it does not support null values. Passing a null value to a DefaultDataFrame at any time will cause a runtime exception.
A NullableDataFrame is a more flexible implementation which can work with null values. Since many programming languages differentiate between primitive data types and objects, this implementation has to use wrapper objects for all primitives as the underlying structure of its columns to allow the use of null values. Generally, as a result, NullableDataFrames are less efficient than DefaultDataFrames. They usually require more memory and some operations are slower. When the usage of null values is not needed, you should always use a DefaultDataFrame.
The concrete Column types to use depends on the DataFrame implementation used. For DefaultDataFrames the concrete Column class is denoted by the type name of the column followed by the 'Column'-postfix. For NullableDataFrames, however, the concrete Column class is prefixed with a 'Nullable'-prefix. (See table in Sec. 1.2 Types).
Limitations
The DataFrame specification is designed to represent a general-purpose data structure, file format and data interchange format. The usage is not limited to pure data analysis tasks. However, the specification was not designed to support extremely large datasets (so-called "big data"). This has several reasons. First, the binary file format does not support random access to individual elements. This means that a DataFrame file must be always read and written all at once. A DataFrame object always resides in memory in its entirety. Currently there is no mechanism for loading data "on demand". This means that DataFrames which do not fit into memory cannot be processed by the underlying system.
DataFrames are not intended to be used as databases. The limitations of above paragraph apply. One might be tempted to view DataFrames as equivalent to tables of a relational database management system (RDBMS). However, the DataFrame file format was not designed to be used as a database. Although it might make sense to use DataFrames as a relational data storage in a prototype or experimental environment, it is recommended to use a standard RDBMS for data storage in production systems. On the other hand, if random access is not required by the underlying use case, using DataFrame files is much easier and maintainable.
The DataFrame file format cannot represent DataFames with more than 232 columns and 232 rows. Consider using another file format (e.g. HDF5) for larger datasets. The concrete programming language might further limit the size of DataFrames usable. For example, in Java, the maximum length of a Column is 231-1 because larger arrays cannot be directly allocated.
Based on the exact use case and requirements, you can split up a dataset into multiple DataFrames and index the individual files as desired.
Getting Started
This section describes how to add DataFrames to your project and how to import the API classes.
Adding DataFrames to your Project
Adding DataFrames to a project is easy as precompiled packages are available for common dependency management systems.
For Java projects, the easiest way to add DataFrames is through the Claymore library, which is available on the Maven Central Repository. (source code)
For Python projects, our official implementation is available on PyPI. (source code)
Below you can find commands and dependency entries for your language:
pip install raven-pydf
<!-- Note: Replace major.minor.patch with concrete version numbers! --> <!-- Maven --> <dependency> <groupId>com.raven-computing</groupId> <artifactId>claymore</artifactId> <version>major.minor.patch</version> </dependency> <!-- Gradle --> implementation 'com.raven-computing:claymore:major.minor.patch'
You can also add the corresponding library manually. Please see the Development section in the source code repository for the corresponding language.
Importing the Classes in your Code
The core usage during development is provided through the DataFrame interface/class. Although when directly referring to concrete implementation classes or concrete Columns additional classes might have to be imported.
For the most basic example, let's see how to import DataFrames and construct a new DefaultDataFrame instance in code:
from raven.struct.dataframe import DataFrame # create a DefaultDataFrame with 3 columns and 3 rows df = DataFrame.Default( DataFrame.IntColumn("A", [1, 2, 3]), DataFrame.FloatColumn("B", [4.4, 5.5, 6.6]) DataFrame.StringColumn("C", ["cat", "dog", "horse"]))
import com.raven.common.struct.DataFrame; import com.raven.common.struct.DefaultDataFrame; import com.raven.common.struct.Column; // create a DefaultDataFrame with 3 columns and 3 rows DataFrame df = new DefaultDataFrame( Column.create("A", 1, 2, 3), Column.create("B", 4.4f, 5.5f, 6.6f), Column.create("C", "cat", "dog", "horse"));
The import statements are slightly different, depending on the language. The concrete way how DataFrames and Columns are constructed is one of the few things which are not strictly defined by the specification. The API in each language might therefore vary slightly
Each concrete DataFrame implementation and concrete Column must be implemented by a separate class. In order to reduce the number of import statements and therefore to make the manual construction of DataFrames more convenient, the core APIs usually provide convenience functions either through the Column API or through the DataFrame interface/class directly.
In the above example, we used the minimum number of import statements. Alternatively, one can construct concrete DataFrame and Column implementations by calling the corresponding constructors directly. The following example creates the same DataFrame as before:
from raven.struct.dataframe import (DefaultDataFrame, IntColumn, FloatColumn, StringColumn) # create a DefaultDataFrame with 3 columns and 3 rows df = DefaultDataFrame( IntColumn("A", [1, 2, 3]), FloatColumn("B", [4.4, 5.5, 6.6]) StringColumn("C", ["cat", "dog", "horse"]))
import com.raven.common.struct.DataFrame; import com.raven.common.struct.DefaultDataFrame; import com.raven.common.struct.IntColumn; import com.raven.common.struct.FloatColumn; import com.raven.common.struct.StringColumn; // create a DefaultDataFrame with 3 columns and 3 rows DataFrame df = new DefaultDataFrame( new IntColumn("A", new int[]{1, 2, 3}), new FloatColumn("B", new float[]{4.4f, 5.5f, 6.6f}), new StringColumn("C", new String[]{"cat", "dog", "horse"}));
Ultimately, how one decides to construct DataFrames and Columns is a matter of taste. Some might prefer to be explicit while others might want shorter code. Both is fine.
Note:
In all subsequent code examples, the import statements will not be explicitly mentioned again for the sake of brevity.DataFrame API
This section describes the DataFrame API. Most of the calls and operations are standardized through the DataFrame specification. The aim is to provide a unified API for working with DataFrames as a data structure. However, since different programming languages have different features and peculiarities, there are some minor variations of the API, for example when manually constructing DataFrame instances.
This section gives descriptions of all API calls and usage examples through code samples in all supported languages. Expected output in all samples is indicated as commented text. In principle, you can directly copy the code samples and run them for example in an interactive Python REPL, provided that you have imported the necessary classes and created the corresponding DataFrame instance that the specific sample uses to illustrate an API call.
Construction
DataFrames can be created in various ways. As DataFrames are normal objects in the object-oriented sence, they can be created by the means of a standard constructor. Of course, this also applies to Column objects. In the simplest way, you can construct an empty DataFrame by using the default constructor and not specifying any arguments.
For example:
df1 = DataFrame.Default() df2 = DataFrame.Nullable() # or alternatively, when the necessary import statements are present: df1 = DefaultDataFrame() df2 = NullableDataFrame()
DataFrame df1 = new DefaultDataFrame(); DataFrame df2 = new NullableDataFrame();
The above example constructs two DataFrames, a DefaultDataFrame (non-nullable) and a NullableDataFrame. Since we have not defined and added any Columns yet, both DataFrames are completely empty. They are said to be uninitialized.
You can now add as many Columns as you want. However, you may also wish to specify all Columns inside a DataFrame at construction. Therefore, you can pass all Columns you want a DataFrame to hold directly to the constructor.
The following example demonstrates how to construct one labeled Column for each type, both for a DefaultDataFrame and a NullableDataFrame:
df1 = DataFrame.Default( DataFrame.ByteColumn("A", [10, 11, 12]), DataFrame.ShortColumn("B", [13, 14, 15]), DataFrame.IntColumn("C", [16, 17, 18]), DataFrame.LongColumn("D", [19, 20, 21]), DataFrame.FloatColumn("E", [22.1, 23.2, 24.3]), DataFrame.DoubleColumn("F", [25.4, 26.5, 27.6]), DataFrame.StringColumn("G", ["car", "airplane", "bike"]), DataFrame.CharColumn("H", ["a", "b", "c"]), DataFrame.BooleanColumn("I", [True, True, False]), DataFrame.BinaryColumn("J", [bytearray.fromhex("00aa"), bytearray.fromhex("0102bb"), bytearray.fromhex("030405cc")])) df2 = DataFrame.Nullable( DataFrame.NullableByteColumn("A", [10, None, 12]), DataFrame.NullableShortColumn("B", [13, None, 15]), DataFrame.NullableIntColumn("C", [16, None, 18]), DataFrame.NullableLongColumn("D", [19, None, 21]), DataFrame.NullableFloatColumn("E", [22.1, None, 24.3]), DataFrame.NullableDoubleColumn("F", [25.4, None, 27.6]), DataFrame.NullableStringColumn("G", ["car", None, "bike"]), DataFrame.NullableCharColumn("H", ["a", None, "c"]), DataFrame.NullableBooleanColumn("I", [True, None, False]), DataFrame.NullableBinaryColumn("J", [bytearray.fromhex("00aa"), None, bytearray.fromhex("030405cc")])) # or alternatively, when the necessary import statements are present: df1 = DefaultDataFrame( ByteColumn("A", [10, 11, 12]), ShortColumn("B", [13, 14, 15]), IntColumn("C", [16, 17, 18]), LongColumn("D", [19, 20, 21]), FloatColumn("E", [22.1, 23.2, 24.3]), DoubleColumn("F", [25.4, 26.5, 27.6]), StringColumn("G", ["car", "airplane", "bike"]), CharColumn("H", ["a", "b", "c"]), BooleanColumn("I", [True, True, False]), BinaryColumn("J", [bytearray.fromhex("00aa"), bytearray.fromhex("0102bb"), bytearray.fromhex("030405cc")])) df2 = NullableDataFrame( NullableByteColumn("A", [10, None, 12]), NullableShortColumn("B", [13, None, 15]), NullableIntColumn("C", [16, None, 18]), NullableLongColumn("D", [19, None, 21]), NullableFloatColumn("E", [22.1, None, 24.3]), NullableDoubleColumn("F", [25.4, None, 27.6]), NullableStringColumn("G", ["car", None, "bike"]), NullableCharColumn("H", ["a", None, "c"]), NullableBooleanColumn("I", [True, None, False]), NullableBinaryColumn("J", [bytearray.fromhex("00aa"), None, bytearray.fromhex("030405cc")]))
DataFrame df1 = new DefaultDataFrame( Column.create("A", (byte)10, (byte)11, (byte)12), Column.create("B", (short)13, (short)14, (short)15), Column.create("C", 16, 17, 18), Column.create("D", 19L, 20L, 21L), Column.create("E", 22.1f, 23.2f, 24.3f), Column.create("F", 25.4, 26.5, 27.6), Column.create("G", "car", "airplane", "bike"), Column.create("H", 'a', 'b', 'c'), Column.create("I", true, true, false), Column.create("J", new byte[]{0x00, (byte)0xaa}, new byte[]{0x01, 0x02, (byte)0xbb}, new byte[]{0x03, 0x04, 0x05, (byte)0xcc})); DataFrame df2 = new NullableDataFrame( Column.nullable("A", (byte)10, null, (byte)12), Column.nullable("B", (short)13, null, (short)15), Column.nullable("C", 16, null, 18), Column.nullable("D", 19L, null, 21L), Column.nullable("E", 22.1f, null, 24.3f), Column.nullable("F", 25.4, null, 27.6), Column.nullable("G", "car", null, "bike"), Column.nullable("H", 'a', null, 'c'), Column.nullable("I", true, null, false), Column.nullable("J", new byte[]{0x00, (byte)0xaa}, null, new byte[]{0x03, 0x04, 0x05, (byte)0xcc})); // or alternatively, when the necessary import statements are present: DataFrame df1 = new DefaultDataFrame( new ByteColumn("A", new byte[]{(byte)10, (byte)11, (byte)12}), new ShortColumn("B", new short[]{(short)13, (short)14, (short)15}), new IntColumn("C", new int[]{16, 17, 18}), new LongColumn("D", new long[]{19L, 20L, 21L}), new FloatColumn("E", new float[]{22.1f, 23.2f, 24.3f}), new DoubleColumn("F", new double[]{25.4, 26.5, 27.6}), new StringColumn("G", new String[]{"car", "airplane", "bike"}), new CharColumn("H", new char[]{'a', 'b', 'c'}), new BooleanColumn("I", new boolean[]{true, true, false}), new BinaryColumn("J", new byte[][]{new byte[]{0x00, (byte)0xaa}, new byte[]{0x01, 0x02, (byte)0xbb}, new byte[]{0x03, 0x04, 0x05, (byte)0xcc}})); DataFrame df2 = new NullableDataFrame( new NullableByteColumn("A", new Byte[]{(byte)10, null, (byte)12}), new NullableShortColumn("B", new Short[]{(short)13, null, (short)15}), new NullableIntColumn("C", new Integer[]{16, null, 18}), new NullableLongColumn("D", new Long[]{19L, null, 21L}), new NullableFloatColumn("E", new Float[]{22.1f, null, 24.3f}), new NullableDoubleColumn("F", new Double[]{25.4, null, 27.6}), new NullableStringColumn("G", new String[]{"car", null, "bike"}), new NullableCharColumn("H", new Character[]{'a', null, 'c'}), new NullableBooleanColumn("I", new Boolean[]{true, null, false}), new NullableBinaryColumn("J", new byte[][]{new byte[]{0x00, (byte)0xaa}, null, new byte[]{0x03, 0x04, 0x05, (byte)0xcc}}));
Please note that you can optionally also pass default Column instances to the NullableDataFrame constructor. However, since the Columns are internally converted to a corresponding nullable Column instance, it is less efficient because to conversion entails copying the Column values. Be aware that the opposite is not possible, i.e. the NullableDataFrame constructor MUST use nullable Column instances.
In the above example, all Columns were also labeled directly during their construction. Although it is recommended for most use cases to use labeled Columns, it is by no means a necessity. That is, you can also construct all Column instances without providing a name.
For example, the following code shows how the construct a DefaultDataFrame with unlabeled Columns:
# the following columns are not labeled. # you have to omit the column names (leave them as None) df = DefaultDataFrame( StringColumn(values=["good", "medium", "bad"]), IntColumn(values=[10, 5, 0]))
// the following columns are not labeled. // you have to use the constructors of the columns directly DataFrame df = new DefaultDataFrame( new StringColumn(new String[]{"good", "medium", "bad"}), new IntColumn(new int[]{10, 5, 0}));
You can set column names at any time by calling the appropriate methods, see Sec. 3.5 Column Names.
Value Access
All values inside a DataFrame can be accessed and set individually. It can be done by using ordinary getters and setters. Since DataFrames use typed columns, methods for reading and writing values contain the type name in their signature.
Both read and write operations of individual values is guaranteed to be perfomed in constant time O(1).
Getters
Use the appropriate get method to retrieve an individual value inside a DataFrame. Two positions have to be specified when calling a get method: the column index and row index. The column index can be replaced by the column name if such a column exists inside the DataFrame.
The following example demonstrates how to access individual values for every supported type. The underlying DataFrame comes from the second example in Sec. 3.1 Construction
mybyte = df1.get_byte("A", 1) # returns a Python int myshort = df1.get_short("B", 1) # returns a Python int myint = df1.get_int("C", 1) # returns a Python int mylong = df1.get_long("D", 1) # returns a Python int myfloat = df1.get_float("E", 1) # returns a Python float mydouble = df1.get_double("F", 1) # returns a Python float mystring = df1.get_string("G", 1) # returns a Python str mychar = df1.get_char("H", 1) # returns a Python str of length 1 mybool = df1.get_boolean("I", 1) # returns a Python bool mybytearray = df1.get_binary("J", 1) # returns a Python bytearray
// variable types can be primitives when working with DefaultDataFrames // since all values are guaranteed to be non-null byte mybyte = df1.getByte("A", 1); short myshort = df1.getShort("B", 1); int myint = df1.getInt("C", 1); long mylong = df1.getLong("D", 1); float myfloat = df1.getFloat("E", 1); double mydouble = df1.getDouble("F", 1); String mystring = df1.getString("G", 1); char mychar = df1.getChar("H", 1); boolean mybool = df1.getBoolean("I", 1); byte[] mybytearray = df1.getBinary("J", 1);
Equivalently, the columns A-J can be referenced with their corresponding column index 0-9 when getting a value. When using DefaultDataFrames, all values returned by a get method are guaranteed to be non-null. However, when using NullableDataFrames, values returned by a get method might be null (or the equivalent for the underlying language).
For example, when calling the get methods on df2 instead of df1 then the returned values will be null because all entries in the row at index 1 were set to null in the second example in Sec. 3.1 Construction.
myfloat = df2.get_float("E", 1) # returns None
// variable types should be primitive wrapper objects when working // with NullableDataFrames since the value returned by get methods // might be null Float myfloat = df2.getFloat("E", 1); // returns null
Setters
Analogous to get methods, you can use set methods to write individual values inside a DataFrame. Since columns are strongly typed, all DataFrames will enforce the correct type for each element in all operations, even in dynamically typed languages like Python.
For example, the following code will set a new value for each column in the DataFrame from the second example in Sec. 3.1 Construction at index 1:
df1.set_byte("A", 1, 42) # must be a Python int df1.set_short("B", 1, 42) # must be a Python int df1.set_int("C", 1, 42) # must be a Python int df1.set_long("D", 1, 42) # must be a Python int df1.set_float("E", 1, 42.123) # must be a Python float df1.set_double("F", 1, 42.123) # must be a Python float df1.set_string("G", 1, "Hello") # must be a Python str df1.set_char("H", 1, "x") # must be a Python str of length 1 df1.set_boolean("I", 1, False) # must be a Python bool df1.set_binary("J", 1, bytearray.fromhex("aabbccff")) # must be a Python bytearray
df1.setByte("A", 1, (byte)42); df1.setShort("B", 1, (short)42); df1.setInt("C", 1, 42); df1.setLong("D", 1, 42L); df1.setFloat("E", 1, 42.123f); df1.setDouble("F", 1, 42.123); df1.setString("G", 1, "Hello"); df1.setChar("H", 1, 'x'); df1.setBoolean("I", 1, false); df1.setBinary("J", 1, new byte[]{(byte)0xaa, (byte)0xbb, (byte)0xcc, (byte)0xff});
Equivalently, the columns A-J can be referenced with their corresponding column index 0-9 when setting a value. When using DefaultDataFrames, the specified value must not be null. When using NullableDataFrames, the specified value may be null.
For example, when calling the set methods on df2 instead of df1 then the specified values may be null.
# set the value in the 'E' column at row index 2 to None df2.set_float("E", 2, None)
// set the value in the 'E' column at row index 2 to null df2.setFloat("E", 2, null);
Metrics
Because DataFrames are complex objects consisting of one or more columns, they exhibit certain properties. These properties can be queried at any time by calling the corresponding method.
Columns and Rows
The current number of columns and rows can be queried by simple method calls.
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # 3| 44 aad False # 4| 55 aae True print(df.columns()) # 3 print(df.rows()) # 5
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // 3| 44 aad false // 4| 55 aae true System.out.println(df.columns()); // 3 System.out.println(df.rows()); // 5
Capacity
The capacity of a DataFrame is the number of rows it can hold without the necessity of a resizing operation. This is done so that adding, inserting and removing rows is more efficient because copy operations do not have to be performed in every method call. The capacity therefore is the actual length of each internal array used by every Column within a particular DataFrame.
The following example illustrates how the capacity behaves when more rows are added to a DataFrame:
df = DefaultDataFrame( IntColumn("A", [11, 22, 33]), StringColumn("B", ["aaa", "aab", "aac"]), BooleanColumn("C", [True, True, False])) print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False print(df.capacity()) # 3 df.add_row([44, "aad", True]) print(df.capacity()) # 6 df.add_row([55, "aae", False]) print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # 3| 44 aad True # 4| 55 aae False print(df.capacity()) # 6
DataFrame df = new DefaultDataFrame( Column.create("A", 11, 22, 33), Column.create("B", "aaa", "aab", "aac"), Column.create("C", true, true, false)); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false System.out.println(df.capacity()); // 3 df.addRow(44, "aad", true); System.out.println(df.capacity()); // 6 df.addRow(55, "aae", false); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // 3| 44 aad true // 4| 55 aae false System.out.println(df.capacity()); // 6
In the above example a DataFrame with 3 columns and 3 rows is constructed. Since the arrays of each Column are specified at construction, the capacity of the created DataFrame is equalm to the number of rows, i.e. there is no additional buffer present. When an additional row is added to the DataFrame, the capacity must be increased so that the row fits into the DataFrame. The concrete resizing strategy is an implementation detail of a DataFrame. In this example the capacity is doubled. When another row is added the capacity is not increased again because the underlying buffered space is large enough to hold the provided row data. Therefore, after having added the two rows, the length of the internal arrays in each Column is actually 6 even though the DataFrame has 5 rows.
The capacity of a DataFrame can be controlled via the flush() method (see Sec. 3.17.8 Flush).
Column Operations
Columns can be added, set, inserted and removed at any time.
Add
New Column objects can be added to a DataFrame, which will place the specified Column at the right end and assign it the corresponding column index. When working with DefaultDataFrames the Column length must match the length of the already existing Columns. When working with NullableDataFrames, then all Columns are resized when the provided Column has a different length and missing values are set to null values. When an empty Column is provided, then all column entries are set to either default values or null values, depending on the Column type.
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False df.add_column(FloatColumn("D", [1.0, 2.0, 3.0])) print(df) # _| A B C D # 0| 11 aaa True 1.0 # 1| 22 aab True 2.0 # 2| 33 aac False 3.0
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false df.addColumn(new FloatColumn("D", new float[]{1.0f, 2.0f, 3.0f})); System.out.println(df); // _| A B C D // 0| 11 aaa true 1.0 // 1| 22 aab true 2.0 // 2| 33 aac false 3.0
The name by which a Column should be referenceable within a DataFrame can be explicitly set when adding a Column. Be aware that this will also override the name within the specified Column.
The following example illustrates this:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False col = FloatColumn("D", [1.0, 2.0, 3.0]) print(col.get_name()) # D df.add_column(col, name="F") print(df) # _| A B C F # 0| 11 aaa True 1.0 # 1| 22 aab True 2.0 # 2| 33 aac False 3.0 print(col.get_name()) # F
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false Column col = new FloatColumn("D", new float[]{1.0f, 2.0f, 3.0f}); System.out.println(col.getName()); // D df.addColumn("F", col); System.out.println(df); // _| A B C F // 0| 11 aaa true 1.0 // 1| 22 aab true 2.0 // 2| 33 aac false 3.0 System.out.println(col.getName()); // F
Insert
Column objects can be inserted at a specific position (column index) within a DataFrame. The Column at the specified index and all Columns to the right of that position are shifted to the right. Therefore, all Columns to the right of the specified index will be referenceable by their original column index incremented by 1. The Column names are not affected by this operation.
The following example shows the insertion of a Column:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # insert a new column df.insert_column(1, FloatColumn("D")) print(df) # _| A D B C # 0| 11 0.0 aaa True # 1| 22 0.0 aab True # 2| 33 0.0 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // insert a new column df.insertColumn(1, new FloatColumn("D")); System.out.println(df); // _| A D B C // 0| 11 0.0 aaa true // 1| 22 0.0 aab true // 2| 33 0.0 aac false
In the above example the added FloatColumn is originally empty. Because the DataFrame already has 3 rows the inserted Column is resized and the missing values are replaced with default values of the FloatColumn.
The name of the inserted Column can be explicitly set when inserting:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False col = FloatColumn("D") print(col.get_name()) # D df.insert_column(1, col, name="F") print(df) # _| A F B C # 0| 11 0.0 aaa True # 1| 22 0.0 aab True # 2| 33 0.0 aac False print(col.get_name()) # F
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false Column col = new FloatColumn("D"); System.out.println(col.getName()); // D df.insertColumn(1, "F", col); System.out.println(df); // _| A F B C // 0| 11 0.0 aaa true // 1| 22 0.0 aab true // 2| 33 0.0 aac false System.out.println(col.getName()); // F
Remove
Column instances can be removed from a DataFrame in three ways: by column index, by column name and by object reference.
The corresponding method returns the removed Column instance when the Column argument is specified as an index or name:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False col = df.remove_column("B") # returns the removed column # or equivalently: # col = df.remove_column(1) print(df) # _| A C # 0| 11 True # 1| 22 True # 2| 33 False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false Column col = df.removeColumn("B"); // returns the removed column // or equivalently: // Column col = df.removeColumn(1); System.out.println(df); // _| A C // 0| 11 true // 1| 22 true // 2| 33 false
When the argument is specified as a Column instance, the corresponding method returns a boolean value which indicates whether the specified Column was successfully removed. If the specified Column is is not part of the underlying DataFrame, then the method call has no effect and a boolean value of false is returned.
For example:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False col1 = df.get_column("B") col2 = IntColumn("F") val = df.remove_column(col1) # returns a bool print(val) # True val = df.remove_column(col2) print(val) # False print(df) # _| A C # 0| 11 True # 1| 22 True # 2| 33 False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false Column col1 = df.getColumn("B"); Column col2 = new IntColumn("F"); boolean val = df.removeColumn(col1); System.out.println(val); // true val = df.removeColumn(col2); System.out.println(val); // false System.out.println(df); // _| A C // 0| 11 true // 1| 22 true // 2| 33 false
Set Columns
Columns can be explicitly set. If the specified Column is not already part of the underlying DataFrame, then the behaviour is equivalent to adding the Column to the DataFrame. On the other hand, if the specified Column is already present within the underlying DataFrame, then the present Column will be replaced by the specified instance.
For example, a particular column inside a DataFrame can be replaced by another Column instance:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False df.set_column("B", FloatColumn()) print(df) # _| A B C # 0| 11 0.0 True # 1| 22 0.0 True # 2| 33 0.0 False df.set_column(2, IntColumn()) print(df) # _| A B C # 0| 11 0.0 0 # 1| 22 0.0 0 # 2| 33 0.0 0
System.out.println(df); // _| A B C // 0| 11 aaa True // 1| 22 aab True // 2| 33 aac False df.setColumn("B", new FloatColumn()); System.out.println(df); // _| A B C // 0| 11 0.0 True // 1| 22 0.0 True // 2| 33 0.0 False df.setColumn(2, new IntColumn()); System.out.println(df); // _| A B C // 0| 11 0.0 0 // 1| 22 0.0 0 // 2| 33 0.0 0
The above example shows that the type of the Column specified as the method argument does not necessarily have to be equal to the type of the Column inside of the DataFrame.
If the specified column name is not present in the underlying DataFrame, then the argument Column is effectively added to the DataFrame:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False df.set_column("D", FloatColumn()) print(df) # _| A B C D # 0| 11 aaa True 0.0 # 1| 22 aab True 0.0 # 2| 33 aac False 0.0
System.out.println(df); // _| A B C // 0| 11 aaa True // 1| 22 aab True // 2| 33 aac False df.setColumn("D", new FloatColumn()); System.out.println(df); // _| A B C D // 0| 11 aaa True 0.0 // 1| 22 aab True 0.0 // 2| 33 aac False 0.0
Get Columns
The Column objects themselves can be referenced both individually and as a group. Accessing one individual Column will simply provide a reference to a particular Column instance.
For example:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False col = df.get_column("B") print(col) # <raven.struct.dataframe.stringcolumn.StringColumn object at 0x7f9334c2f7f0> print(col.get_value(1)) # aab col = df.get_column(2) print(col) # <raven.struct.dataframe.booleancolumn.BooleanColumn object at 0x7f9334d231f0> print(col.get_value(1)) # True
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false Column col = df.getColumn("B"); System.out.println(col); // com.raven.common.struct.StringColumn@5b3e8e3 System.out.println(col.getValue(1)); // aab col = df.getColumn(2); System.out.println(col); // com.raven.common.struct.BooleanColumn@131b97 System.out.println(col.getValue(1)); // true
Alternatively, the DataFrame API provides a method to get multiple Columns bound together in a new DataFrame instance of the same type. The Columns can be selected by index, name or type.
The following example illustrates this:
print(df) # _| A B C D E F # 0| 11 aaa True 44 11.0 bba # 1| 22 aab True 55 12.0 bbb # 2| 33 aac False 66 13.0 bbc # get columns by names df2 = df.get_columns(cols=("B", "F", "A")) print(df2) # _| B F A # 0| aaa bba 11 # 1| aab bbb 22 # 2| aac bbc 33 # get columns by indices df2 = df.get_columns(cols=(1, 2, 4)) print(df2) # _| B C E # 0| aaa True 11.0 # 1| aab True 12.0 # 2| aac False 13.0 # get columns by types df2 = df.get_columns(types=("int", "float")) print(df2) # _| A D E # 0| 11 44 11.0 # 1| 22 55 12.0 # 2| 33 66 13.0
System.out.println(df); // _| A B C D E F // 0| 11 aaa true 44 11.0 bba // 1| 22 aab true 55 12.0 bbb // 2| 33 aac false 66 13.0 bbc // get columns by names DataFrame df2 = df.getColumns("B", "F", "A"); System.out.println(df2); // _| B F A // 0| aaa bba 11 // 1| aab bbb 22 // 2| aac bbc 33 // get columns by indices df2 = df.getColumns(1, 2, 4); System.out.println(df2); // _| B C E // 0| aaa true 11.0 // 1| aab true 12.0 // 2| aac false 13.0 // get columns by types df2 = df.getColumns(Integer.class, Float.class); System.out.println(df2); // _| A D E // 0| 11 44 11.0 // 1| 22 55 12.0 // 2| 33 66 13.0
The order of the arguments defines the order of the Columns in the returned DataFrame. One important thing to note is that all Columns are only added to the returned DataFrame by reference, i.e. selecting columns does not perform any operations. If you want a truly independent DataFrame as a result, you must explicitly copy the returned DataFrame.
Direct Access
Columns inside a DataFrame can be accessed as shown in the previous sections. As a Column is simply a container for data, it provides methods for reading and writing individual values directly. This can become useful when writing highly optimized code because certain access checks performed by DataFrames are then omitted.
Warning:
Directly accessing and manipulating values inside a Column instance is a more low-level operation. Column instances are allowed to throw exceptions other than DataFrameException!The following example shows how to get a reference to a Column instance and set a specific value directly:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False col = df.get_column("B") col.set_value(1, "New Value") print(df) # _| A B C # 0| 11 aaa True # 1| 22 New Value True # 2| 33 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false Column col = df.getColumn("B"); col.setValue(1, "New Value"); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 New Value true // 2| 33 aac false
You can even get a reference to the internal array of the underlying Column:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # get the numpy array of column A (may include a capacity buffer) array = df.get_column("A").as_array() print(array) # [11 22 33] array[1] = 42 print(df) # _| A B C # 0| 11 aaa True # 1| 42 aab True # 2| 33 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // get the internal array object (may include a capacity buffer) Column col = df.getColumn("A"); int[] array = ((IntColumn)col).asArray(); System.out.println(Arrays.toString(array)); // [11, 22, 33] array[1] = 42; System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 42 aab true // 2| 33 aac false
Warning:
No access checks by the underlying DataFrame are perfomed when handling internal arrays directly. Misuse of direct array access might lead to an invalid DataFrame state!Column Iteration
DataFrames can be iterated over in multiple ways. The following example shows how to iterate over all columns within a DataFrame:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # with a for-each loop: for col in df: print(col.get_name()) # A # B # C # with a classic range loop: for i in range(df.columns()): print(df.get_column(i).get_name()) # A # B # C
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // with a for-each loop: for(Column col : df){ System.out.println(col.getName()); } // A // B // C // with a classic c-style loop: for(int i=0; i<df.columns(); ++i){ System.out.println(df.getColumn(i).getName()); } // A // B // C
Column Names
Column names can be queried and set, for individual Columns or the entire DataFrame.
Get, Set and Remove all Column Names
The DataFrame API provides methods to query and manipulate column names for the entire DataFrame at once. All column names are represented as strings. Please note that column names must never be specified as null or empty strings. Therefore, any valid column name consists of at least one string character.
The following example shows how to get and set all column names at once:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False names = df.get_column_names() # returns a list print(names) # ['A', 'B', 'C'] df.set_column_names(["X", "Y", "Z"]) print(df) # _| X Y Z # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false String[] names = df.getColumnNames(); System.out.println(Arrays.toString(names)); // [A, B, C] df.setColumnNames("X", "Y", "Z"); System.out.println(df); // _| X Y Z // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false
Since a column name argument cannot be specified as null or empty, the DataFrame API provides methods to remove all columnn names and to explicitly query whether column names are set.
The following example illustrates this:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False val = df.has_column_names() # returns a bool print(val) # True df.remove_column_names() val = df.has_column_names() print(val) # False print(df) # _| 0 1 2 # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false boolean val = df.hasColumnNames(); System.out.println(val); // true df.removeColumnNames(); val = df.hasColumnNames(); System.out.println(val); // false System.out.println(df); // _| 0 1 2 // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false
The above example shows that columns without a name are represented with their column index when converting a DataFrame to a string representation.
Individual Columns
Column name operations can also be performed for individual Columns. The following example shows how to query and set names of individual Columns:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False name = df.get_column_name(1) # returns a str print(name) # 'B' df.set_column_name(1, "NewName") name = df.get_column_name(1) print(name) # 'NewName' df.set_column_name("C", "OtherName") name = df.get_column_name(2) print(name) # 'OtherName' print(df) # _| A NewName OtherName # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false String name = df.getColumnName(1); System.out.println(name); // B df.setColumnName(1, "NewName"); name = df.getColumnName(1); System.out.println(name); // NewName df.setColumnName("C", "OtherName"); name = df.getColumnName(2); System.out.println(name); // OtherName System.out.println(df); // _| A NewName OtherName // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false
The column index an individual Column is located at inside a DataFrame can be queried, as shown in the following example:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False index = df.get_column_index("A") # returns an int print(index) # 0 index = df.get_column_index("B") print(index) # 1
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false int index = df.getColumnIndex("A"); System.out.println(index); // 0 index = df.getColumnIndex("B"); System.out.println(index); // 1
The existence of a Column with a particular name can be queried with a separate method:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False val = df.has_column("B") # returns a bool print(val) # True val = df.has_column("Data") print(val) # False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false boolean val = df.hasColumn("B"); System.out.println(val); // true val = df.hasColumn("Data"); System.out.println(val); // false
Row Operations
This section describes how to add, insert, set and remove rows inside a DataFrame. All rows can also be queried in various ways. Even when you think of a DataFrame as a collection if rows, it is important to understand that the values of rows are not stored in memory as row objects by a DataFrame, but rather as column values. Therefore, a row can be simply seen as the array of values inside all Columns at a specific row index. The values in a row can heterogeneous, i.e. all values are of the corresponding type used by the Column the row item is part of.
Add
Rows can be added to a DataFrame. This might entail a resizing operation of all Columns if the underlying DataFrame has no free capacity to store the provided row items inside the corresponding Columns (see Sec. 3.3.2 Capacity).
The following example shows how the add a single row to a DataFrame:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False df.add_row([44, "aad", True]) print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # 3| 44 aad True # this would fail because of a wrong type: # v # df.add_row([55.5, "aae", False]) # this would also fail because the row is too long: # v # df.add_row([55, "aae", False, "anotherItem"])
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false df.addRow(44, "aad", true); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // 3| 44 aad true // this would fail because of a wrong type: // v // df.addRow(55.5f, "aae", false); // this would also fail because the row is too long: // v // df.addRow(55, "aae", false, "anotherItem");
As seen in the above example, all row item types must exactly match the element type of each corresponding Column. Row item types are not automatically converted. This also means, for example, that you cannot specify a row item as the number 300 if the Column at the index of the row item is a ByteColumn, because a byte has only a valid range of [-128, +127].
Additionally, the length of the specified row must match the number of Columns within the DataFrame such that every row item has a corresponding Column to which it is added.
Alternatively, rows can be also be added from another DataFrame instance directly. The following example illustrates this:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False print(df2) # _| A B C # 0| 97 cca True # 1| 98 ccb False # 2| 99 ccc True df.add_rows(df2) print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # 3| 97 cca True # 4| 98 ccb False # 5| 99 ccc True
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false System.out.println(df2); // _| A B C // 0| 97 cca true // 1| 98 ccb false // 2| 99 ccc true df.addRows(df2); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // 3| 97 cca true // 4| 98 ccb false // 5| 99 ccc true
Insert
Rows can be inserted into a DataFrame at a specific row index. The same restrictions apply compared to row additions. The following example shows how to insert rows into a DataFrame:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False df.insert_row(1, [42, "AAA", False]) print(df) # _| A B C # 0| 11 aaa True # 1| 42 AAA False # 2| 22 aab True # 3| 33 aac False df.insert_row(0, [99, "BBB", False]) print(df) # _| A B C # 0| 99 BBB False # 1| 11 aaa True # 2| 42 AAA False # 3| 22 aab True # 4| 33 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // the first argument specifies the row index df.insertRow(1, 42, "AAA", false); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 42 AAA false // 2| 22 aab true // 3| 33 aac false df.insertRow(0, 99, "BBB", false); System.out.println(df); // _| A B C // 0| 99 BBB false // 1| 11 aaa true // 2| 42 AAA false // 3| 22 aab true // 4| 33 aac false
Remove
Rows can be removed in two ways. You can either specify a range of row indices which removes all rows within that range, or you can specify a column and regular expression which removes all rows that match the regex in the specified Column.
How to remove rows in a given range is shown in the following example:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # 3| 44 aad False # 4| 55 aae True # 5| 66 aaf False # 6| 77 aag True df.remove_rows(from_index=1, to_index=5) print(df) # _| A B C # 0| 11 aaa True # 1| 66 aaf False # 2| 77 aag True
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // 3| 44 aad false // 4| 55 aae true // 5| 66 aaf false // 6| 77 aag true df.removeRows(1, 5); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 66 aaf false // 2| 77 aag true
How to remove rows that match specific values in a given Column is shown in the following example:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # 3| 44 aad False # 4| 55 aae True # 5| 66 aaf False # 6| 77 aag True df.remove_rows("B", "aa[b-e]") print(df) # _| A B C # 0| 11 aaa True # 1| 66 aaf False # 2| 77 aag True
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // 3| 44 aad false // 4| 55 aae true // 5| 66 aaf false // 6| 77 aag true df.removeRows("B", "aa[b-e]"); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 66 aaf false // 2| 77 aag true
Set Rows
Rows at specific indices can be set directly. The following example illustrates this:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False df.set_row(1, [42, "AAA", False]) print(df) # _| A B C # 0| 11 aaa True # 1| 42 AAA False # 3| 33 aac False
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false df.setRow(1, 42, "AAA", false); System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 42 AAA false // 3| 33 aac false
Get Rows
You can get rows from a DataFrame in two ways. You can either select a single row by its row index or select multiple rows by a range of row indices.
How to get a row at a specific row index is shown in the following example:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False row = df.get_row(1) #returns a list print(row) # [22, "aab", True]
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false Object[] row = df.getRow(1); System.out.println(Arrays.toString(row)); // [22, aab, true]
Hot to get multiple rows, bound together in a DataFrame with the same column structure, is shown in the following example:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False # 3| 44 aad False # 4| 55 aae True # 5| 66 aaf False # 6| 77 aag True df2 = df.get_rows(2, 5) # returns a DataFrame print(df2) # _| A B C # 0| 33 aac False # 1| 44 aad False # 2| 55 aae True
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false // 3| 44 aad false // 4| 55 aae true // 5| 66 aaf false // 6| 77 aag true DataFrame df2 = df.getRows(2, 5); System.out.println(df2); // _| A B C // 0| 33 aac false // 1| 44 aad false // 2| 55 aae true
Row Iteration
DataFrames can be iterated over in multiple ways. The following example shows how to iterate over all rows within a DataFrame:
print(df) # _| A B C # 0| 11 aaa True # 1| 22 aab True # 2| 33 aac False for i in range(df.rows()): print(df.get_row(i)) # [11, 'aaa', True] # [22, 'aab', True] # [33, 'aac', False]
System.out.println(df); // _| A B C // 0| 11 aaa true // 1| 22 aab true // 2| 33 aac false for(int i=0; i<df.rows(); ++i){ System.out.println(Arrays.toString(df.getRow(i))); } // [11, aaa, true] // [22, aab, true] // [33, aac, false]
Search Operations
Specific elements can be searched for. The elements can be specified as regular expressions to allow easy pattern matching without the need to write complex multi-line conditional code. Searching is always done with respect to one particular Column. You may search for both a single element or all elements within a specified Column that match a given regular expression. Since the condition to match all elements against is specified as a regular expression, the search term is always a string, even when searching for a specific number.
Search for a Single Element
The following example shows how to search for a single element in a specific Column:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B index = df.index_of("name", "Paul") #returns an int print(index) # 4 index = df.index_of("name", "Steven") print(index) # -1 index = df.index_of("age", "2\\d") print(index) # 2
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B int index = df.indexOf("name", "Paul"); System.out.println(index); // 4 index = df.indexOf("name", "Steven"); System.out.println(index); // -1 index = df.indexOf("age", "2\\d"); System.out.println(index); // 2
The above example shows that the indexOf() method simply returns the index of the first element that matches the specified search term, or -1 if no element in the specified Column matches the seach term.
Optionally, you can specify a row index from which to start searching. This will effectively ignore all column values prior to that index in the search operation. The following example illustrates how to do that:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B index = df.index_of("age", "2\\d", start_from=3) print(index) # 4 index = df.index_of("name", "Bob", start_from=2) print(index) # -1
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B index = df.indexOf("age", 3, "2\\d"); System.out.println(index); // 4 int index = df.indexOf("name", 2, "Bob"); System.out.println(index); // -1
Search for Multiple Elements
If you want to get the row indices of all elements that match the given search term, then the DataFrame API provides an alternative method for that purpose.
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B indices = df.index_of_all("age", "2\\d") #returns a list of int print(indices) # [2, 4, 5] indices = df.index_of_all("active", "True") print(indices) # [0, 2, 3, 4] indices = df.index_of_all("group", "F") print(indices) # []
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B int[] indices = df.indexOfAll("age", "2\\d"); System.out.println(Arrays.toString(indices)); // [2, 4, 5] indices = df.indexOfAll("active", "true"); System.out.println(Arrays.toString(indices)); // [0, 2, 3, 4] indices = df.indexOfAll("group", "F"); System.out.println(Arrays.toString(indices)); // []
This operation is useful when you don't just care about the elements you are searching for but rather the row indices that they are located at. This can then easily be used to do something with other elements in those rows.
For example, the following code shows how to print the name of all people who are in their twenties:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B for index in df.index_of_all("age", "2\\d"): print(df.get_string("name", index)) # Mark # Paul # Simon
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B for(int index : df.indexOfAll("age", "2\\d")){ System.out.println(df.getString("name", index)); } // Mark // Paul // Simon
Since the indexOfAll() method never returns null but rather an empty array/list if there are no matches, the code from the above example can be safely used even in such a situation.
Filter Operations
The DataFrame API provides various methods to filter the content of any DataFrame. Generally, a filter operation acts on all rows that have a matching element in a specific Column. There are two ways you can treat matched rows: either retain or discard them. Therefore, you can use filter operations to specifically keep certain rows in a DataFrame or on the other hand specifically remove certain rows. There are two separate functions for each filter operation mode: one returns the result of the filter operation as a new DataFrame and leaves the original DataFrame unchanged, and the other directly changes the DataFrame that the filter operation is called upon. Additionally, it is also possible to retain a certain number of first or last rows.
The following table gives a summary over the behaviour of all filter operations:
Function | Operation | Mode |
---|---|---|
filter() | Retains matching rows | Returns a new DataFrame |
drop() | Discards matching rows | Returns a new DataFrame |
include() | Retains matching rows | Directly changes the DataFrame |
exclude() | Discards matching rows | Directly changes the DataFrame |
head() | Retains the first n rows | Returns a new DataFrame |
tail() | Retains the last n rows | Returns a new DataFrame |
Please note that all filter operations return a DataFrame instance, regardless of the operation mode. That is, even the include() and exclude() functions return a DataFrame (the same instance that the function was called upon). In this way, you can arbitrarily combine filter operations simply by chaining function calls to get the desired result.
Filter
The filter() function retains all rows that have a matching element in the specified Column. The result of the operation is returned as a new independent DataFrame instance.
The following example shows how to use the filter() function:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B res = df.filter("age", "2\\d") # returns a DataFrame print(res) # _| name age active group # 0| Mark 25 True C # 1| Paul 29 True A # 2| Simon 21 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame res = df.filter("age", "2\\d"); System.out.println(res); // _| name age active group // 2| Mark 25 true C // 4| Paul 29 true A // 5| Simon 21 false B
In the above example, a DataFrame containing attributes of some random people is filtered to only hold all people whose age is in the interval [20, 29]. The result of the computation is returned as a new DataFrame instance. The DataFrame instance that the filter() function was called upon is not changed.
Drop
The drop() function discards all rows that have a matching element in the specified Column. The result of the operation is returned as a new independent DataFrame instance.
The following example shows how to use the drop() function:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B res = df.drop("age", "2\\d") # returns a DataFrame print(res) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Sofia 31 True B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame res = df.drop("age", "2\\d"); System.out.println(res); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Sofia 31 true B
In the above example, a DataFrame containing attributes of some random people is filtered to remove all people whose age is in the interval [20, 29]. The result of the computation is returned as a new DataFrame instance. The DataFrame instance that the drop() function was called upon is not changed.
Include
The include() function retains all rows that have a matching element in the specified Column. The operation is directly performed on the DataFrame instance.
The following example shows how to use the include() function:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B df.include("active", "True") print(df) # _| name age active group # 0| Bill 34 True A # 1| Mark 25 True C # 2| Sofia 31 True B # 3| Paul 29 True A
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B df.include("active", "true"); System.out.println(df); // _| name age active group // 0| Bill 34 true A // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A
In the above example, a DataFrame containing attributes of some random people is filtered to only hold people whose active attribute is true. The computation is directly performed on the DataFrame instance. Therefore, the DataFrame instance that the include() function was called upon is changed.
Exclude
The exclude() function removes all rows that have a matching element in the specified Column. The operation is directly performed on the DataFrame instance.
The following example shows how to use the exclude() function:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B df.exclude("active", "True") print(df) # _| name age active group # 0| Bob 36 False B # 1| Simon 21 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B df.exclude("active", "true"); System.out.println(df); // _| name age active group // 0| Bob 36 false B // 1| Simon 21 false B
In the above example, a DataFrame containing attributes of some random people is filtered to discard all people whose active attribute is true. The computation is directly performed on the DataFrame instance. Therefore, the DataFrame instance that the exclude() function was called upon is changed.
Head and Tail
The head() function simply returns the first n rows inside the DataFrame whereas the tail() function returns the last n rows. The result of the computation is returned as a new independent DataFrame instance.
The following example illustrates this:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B first3 = df.head(3) # returns a DataFrame last3 = df.tail(3) # returns a DataFrame print(first3) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C print(last3) # _| name age active group # 0| Sofia 31 True B # 1| Paul 29 True A # 2| Simon 21 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame first3 = df.head(3); DataFrame last3 = df.tail(3); System.out.println(first3); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C System.out.println(last3); // _| name age active group // 0| Sofia 31 true B // 1| Paul 29 true A // 2| Simon 21 false B
Value Replacement
The DataFrame API provides various ways to change and replace column values. Instead of setting specific values inside a column directly you can use the replace() method to perform a bulk replacement. This involves three parts: the column to replcae values in, an optional condition and the replacement. As always, the Column to replace values in can be specified either by index or by name. The condition is specified as a regular expression that all values to be replaced must match. Therefore, any value that does not match the specified regex is not changed inside the specified Column. The replacement can be either specified as a constant or as a function. The replace() method returns an integer which indicats how many values were replaced by the operation, i.e. the number of values that matched the specified condition and were therefore replaced by the specified value.
Conditional Replacement
In the simplest case, you can use the replace() method to change all values that match a given regex to some other value
For example:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B replaced = df.replace("group", "A", "F") # returns an int print(replaced) # 2 print(df) # _| name age active group # 0| Bill 34 True F # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True F # 5| Simon 21 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B int replaced = df.replace("group", "A", 'F'); System.out.println(replaced); // 2 System.out.println(df); // _| name age active group // 0| Bill 34 true F // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true F // 5| Simon 21 false B
As shown in the above example, the replace() method returns the number of replaced values. Consequently, if nothing matches the given condition, the replace() method returns 0 which indicates that nothing in the DataFrame was changed.
Be aware that the type of the replacement value must be equal to the element type of the underlying Column.
Replacement Function
Alternatively to a plain constant value, you can also use a replacement function. The easiest way to do that is by using lambda expressions.
The following example demonstrates how to use lambda expressions as a replacement function:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B # increase everyone's age by 1 replaced = df.replace("age", replacement=lambda v: v + 1) print(replaced) # 6 # increase the age of people in their twenties by 3 replaced = df.replace("age", "2\\d", lambda v: v + 3) print(replaced) # 2 print(df) # _| name age active group # 0| Bill 35 True A # 1| Bob 37 False B # 2| Mark 29 True C # 3| Sofia 32 True B # 4| Paul 30 True A # 5| Simon 25 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B // increase everyone's age by 1 int replaced = df.replace("age", (Short v) -> (short)(v + 1)); System.out.println(replaced); // 6 // increase the age of people in their twenties by 3 replaced = df.replace("age", "2\\d", (Short v) -> (short)(v + 3)); System.out.println(replaced); // 2 System.out.println(df); // _| name age active group // 0| Bill 35 true A // 1| Bob 37 false B // 2| Mark 29 true C // 3| Sofia 32 true B // 4| Paul 30 true A // 5| Simon 25 false B
As shown in the above example, the type of the value returned by the replacement function must be equal to the element type of the underlying Column, e.g. if the age Column is modeled as a ShortColumn, then the return type must be a valid short value
Optionally, if you need to know the row index of each value you are trying to replace, you can simply adjust the lambda expression to include the row index as an int.
For example, the following code will only increase the age of a person if he is in his twenties and in the B group:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B r = df.replace("age", "2\\d", lambda i, v: v + 1 if df.get_char("group", i) == "B" else v) print(r) # 1 print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 22 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B int r = df.replace("age", "2\\d", (int i, Short v) -> df.getChar("group", i) == 'B' ? (short)(v + 1) : v); System.out.println(r); // 1 System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 22 false B
The above example shows that if a Column value should not be changed (because it does not meet a specific condition), then you can simply return the original Column value (i.e. the replacement function parameter).
Of course, you can also move the age condition into the lambda expression if you want.
Alternatively, for more complex code, you could also define the replacement function somewhere else and then pass it to the replace method as an argument.
The above example could therefore be rewritten as:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B def my_fn(index, value): if df.get_char("group", index) == "B": return value + 1 else: return value df.replace("age", "2\\d", my_fn) print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 22 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B IndexedValueReplacement<Short> myFn = new IndexedValueReplacement<Short>(){ @Override public Short replace(int index, Short value){ if(df.getChar("group", index) == 'B'){ return (short)(value + 1); }else{ return value; } } }; df.replace("age", "2\\d", myFn); System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 22 false B
Replace Columns
Another way to replace values is to replace the entire Column with a Column from another DataFrame. Columns can be matched both by index and by name. Columns that cannot be matched are simply ignored
The following example shows how to replace Columns by Columns from another DataFrame:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B print(df2) # _| level active group # 0| 9 False C # 1| 8 True C # 2| 7 False B # 3| 6 False A # 4| 7 True B # 5| 5 True B replaced = df.replace(df=df2) # returns an int print(replaced) # 2 print(df) # _| name age active group # 0| Bill 34 False C # 1| Bob 36 True C # 2| Mark 25 False B # 3| Sofia 31 False A # 4| Paul 29 True B # 5| Simon 21 True B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B System.out.println(df2); // _| level active group // 0| 9 false C // 1| 8 true C // 2| 7 false B // 3| 6 false A // 4| 7 true B // 5| 5 true B int replaced = df.replace(df2); System.out.println(replaced); // 2 System.out.println(df); // _| name age active group // 0| Bill 34 false C // 1| Bob 36 true C // 2| Mark 25 false B // 3| Sofia 31 false A // 4| Paul 29 true B // 5| Simon 21 true B
The above example shows that the level Column in the DataFrame argument passed to the replace() method is ignored in the operation. The int value returned by the replace() method indicates the number of replaced Columns.
Replace Categories with Factors
StringColumns often don't hold entirely unique data points but rather recurring designations for a particular attribute. Such data points are called categories. Categories are more convenient to read for humans but it is not easy to do numerical computations with them when they are represented as strings. Categories can be replaced by so called factors. A factor is simply a numerical representation of a category. Therefore, replacing categories by their factors unambiguously maps every category to a unique number. This process is a simple way of encoding non-integers into integer numbers.
The following example shows how the replace character categories inside a CharColumn into factors:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B print(df.get_column("group").type_name()) # char cat_map = df.factor("group") # returns a dict print(cat_map) # {'A': 1, 'B': 2, 'C': 3} print(df) # _| name age active group # 0| Bill 34 False 1 # 1| Bob 36 True 2 # 2| Mark 25 False 3 # 3| Sofia 31 False 2 # 4| Paul 29 True 1 # 5| Simon 21 True 2 print(df.get_column("group").type_name()) # int
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B System.out.println(df.getColumn("group").typeName()); // char Map<Object, Integer> catMap = df.factor("group"); System.out.println(catMap); // {A=1, B=2, C=3} System.out.println(df); // _| name age active group // 0| Bill 34 false 1 // 1| Bob 36 true 2 // 2| Mark 25 false 3 // 3| Sofia 31 false 2 // 4| Paul 29 true 1 // 5| Simon 21 true 2 System.out.println(df.getColumn("group").typeName()); // int
As shown in the above example, the factor() method returns a map indicating the carried out replacement of all found categories into factors. It also shows how the group Column has been converted to an IntColumn. Please note that a DataFrame does not keep track of any category-factor-maps, so if you at a later point want to replace the factors back to their corresponding category, you must not lose the reference to the map object returned by the factor() method.
Conversion
The DataFrame API provides methods and utilities to perform various conversions. You may convert specific Columns or an entire DataFrame at once. Converting Columns changes their (element) type. Such a conversion is not in-place because Column instances are immutable with respect to their type. That is, a Column conversion created a new Column instance of the desired type and sets all values to converted values of the underlying Column. The same principle applies to DataFrame conversions which can convert between DataFrame implementations.
Convert Columns
All Columns can be converted to any other Column. The conversion of a Column may throw a DataFrameException if a column value cannot be meaningfully converted to the target type. For example, a Column conversion from a StringColumn to an IntColumn may fail if one encountered value is a non-numeric string, e.g. 'abcd'.
The following example shows the results of various Column conversions:
print(df) # _| A B C # 0| 11.1 42 Yes # 1| 22.2 43 No # 2| 33.3 44 1 print(df.info()) # Type: Default # Columns: 3 # Rows: 3 # _| column type code # 0| A double 7 # 1| B int 3 # 2| C string 5 df.convert("A", "int") df.convert("B", "double") df.convert("C", "boolean") print(df) # _| A B C # 0| 11 42.0 True # 1| 22 43.0 False # 2| 33 44.0 True print(df.info()) # Type: Default # Columns: 3 # Rows: 3 # _| column type code # 0| A int 3 # 1| B double 7 # 2| C boolean 9
System.out.println(df); // _| A B C // 0| 11.1 42 Yes // 1| 22.2 43 No // 2| 33.3 44 1 System.out.println(df.info()); // Type: Default // Columns: 3 // Rows: 3 // _| column type code // 0| A double 7 // 1| B int 3 // 2| C string 5 df.convert("A", IntColumn.TYPE_CODE); df.convert("B", DoubleColumn.TYPE_CODE); df.convert("C", BooleanColumn.TYPE_CODE); System.out.println(df); // _| A B C // 0| 11 42.0 true // 1| 22 43.0 false // 2| 33 44.0 true System.out.println(df.info()); // Type: Default // Columns: 3 // Rows: 3 // _| column type code // 0| A int 3 // 1| B double 7 // 2| C boolean 9
Convert DataFrames
A DataFrame can be converted to a different DataFrame implementation at runtime. That is, a DefaultDataFrame can be converted to a NullableDataFrame and a NullableDataFrame to a DefaultDataFrame. In the first case, the actual values in each Column are not changed. The opposite is not true however. Converting a NullableDataFrame to a DefaultDataFrame can result in loss of information since all null values in each Column are converted to the corresponding default value of the underlying Column.
The following example illustrates the conversion of a NullableDataFrame to a DefaultDataFrame:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| null null null null # 4| null null null null # 5| Simon 21 False B print(df.is_nullable()) # True df = DataFrame.convert_to(df, "default") print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| n/a 0 False ? # 4| n/a 0 False ? # 5| Simon 21 False B print(df.is_nullable()) # False
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| null null null null // 4| null null null null // 5| Simon 21 false B System.out.println(df.isNullable()); // true df = DataFrame.convert(df, DefaultDataFrame.class); System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| n/a 0 false ? // 4| n/a 0 false ? // 5| Simon 21 false B System.out.println(df.isNullable()); // false
The above example shows how values in the row at index 3 and 4 are converted to the corresponding default value of the converted Column. This also illustrates the circumstance that if the DefaultDataFrame in the above example would get converted back to a NullableDataFrame, the replaced (formerly null) values would not get converted back to null values.
Statistical Operations
The DataFrame API provides various methods to gain statistical information.
Minimum
The minimum can be computed for all numeric Columns. The minimum() method can be used in two ways. First, by only specifying the Column index or name, the method computes and returns the minimum, i.e. smallest value, in the specified Column. Second, by additionally specifying a rank, the method computes and returns a DataFrame with the specified number of elements, sorted ascendingly by the specified Column.
The following example shows how to compute the minimum:
print(df) # _| name age score # 0| Bill 34 0.45 # 1| Bob 36 0.62 # 2| Mark 25 0.78 # 3| Sofia 31 0.42 # 4| Paul 29 0.89 # 5| Simon 21 0.57 # returns a Python int because 'age' is not FP youngest = df.minimum("age") print(youngest) # 21 # returns a Python float because 'score' is FP lowest_score = df.minimum("score") print(lowest_score) # 0.42
System.out.println(df); // _| name age score // 0| Bill 34 0.45 // 1| Bob 36 0.62 // 2| Mark 25 0.78 // 3| Sofia 31 0.42 // 4| Paul 29 0.89 // 5| Simon 21 0.57 //can be sefely cast if 'age' column is not FP int youngest = (int) df.minimum("age"); System.out.println(youngest); // 21 double lowestScore = df.minimum("score"); System.out.println(lowestScore); // 0.42
Optionally, you can pass an additional int to the minimum() method which specifies how many ranked minima you want the method to return.
The following example shows how to compute ranked minima:
print(df) # _| name age score # 0| Bill 34 0.45 # 1| Bob 36 0.62 # 2| Mark 25 0.78 # 3| Sofia 31 0.42 # 4| Paul 29 0.89 # 5| Simon 21 0.57 lowest_scores = df.minimum("score", 3) # returns a DataFrame print(lowest_scores) # _| name age score # 0| Sofia 31 0.42 # 1| Bill 34 0.45 # 2| Simon 21 0.57
System.out.println(df); // _| name age score // 0| Bill 34 0.45 // 1| Bob 36 0.62 // 2| Mark 25 0.78 // 3| Sofia 31 0.42 // 4| Paul 29 0.89 // 5| Simon 21 0.57 DataFrame lowestScores = df.minimum("score", 3); System.out.println(lowestScores); // _| name age score // 0| Sofia 31 0.42 // 1| Bill 34 0.45 // 2| Simon 21 0.57
In the above example, the minimum() method returns the 3 rows with the lowest score, sorted in ascending order. Please note that this is the preferred way of computing ranked minima, as opposed to simply sorting the entire DataFrame by the specified Column and selecting the first few rows.
Maximum
The maximum can be computed for all numeric Columns. The maximum() method can be used in two ways. First, by only specifying the Column index or name, the method computes and returns the maximum, i.e. largest value, in the specified Column. Second, by additionally specifying a rank, the method computes and returns a DataFrame with the specified number of elements, sorted descendingly by the specified Column.
The following example shows how to compute the maximum:
print(df) # _| name age score # 0| Bill 34 0.45 # 1| Bob 36 0.62 # 2| Mark 25 0.78 # 3| Sofia 31 0.42 # 4| Paul 29 0.89 # 5| Simon 21 0.57 # returns a Python int because 'age' is not FP oldest = df.maximum("age") print(oldest) # 36 # returns a Python float because 'score' is FP highest_score = df.maximum("score") print(highest_score) # 0.89
System.out.println(df); // _| name age score // 0| Bill 34 0.45 // 1| Bob 36 0.62 // 2| Mark 25 0.78 // 3| Sofia 31 0.42 // 4| Paul 29 0.89 // 5| Simon 21 0.57 //can be sefely cast if 'age' column is not FP int oldest = (int) df.maximum("age"); System.out.println(oldest); // 36 double highestScore = df.maximum("score"); System.out.println(highestScore); // 0.89
Optionally, you can pass an additional int to the maximum() method which specifies how many ranked maxima you want the method to return.
The following example shows how to compute ranked maxima:
print(df) # _| name age score # 0| Bill 34 0.45 # 1| Bob 36 0.62 # 2| Mark 25 0.78 # 3| Sofia 31 0.42 # 4| Paul 29 0.89 # 5| Simon 21 0.57 highest_scores = df.maximum("score", 3) # returns a DataFrame print(highest_scores) # _| name age score # 0| Paul 29 0.89 # 1| Mark 25 0.78 # 2| Bob 36 0.62
System.out.println(df); // _| name age score // 0| Bill 34 0.45 // 1| Bob 36 0.62 // 2| Mark 25 0.78 // 3| Sofia 31 0.42 // 4| Paul 29 0.89 // 5| Simon 21 0.57 DataFrame highestScores = df.maximum("score", 3); System.out.println(highestScores); // _| name age score // 0| Paul 29 0.89 // 1| Mark 25 0.78 // 2| Bob 36 0.62
In the above example, the maximum() method returns the 3 rows with the highest score, sorted in descending order. Please note that this is the preferred way of computing ranked maxima, as opposed to simply sorting the entire DataFrame by the specified Column and selecting the first few rows.
Average
The average can be computed for all numeric Columns.
The following example shows how to compute the average:
print(df) # _| name age score # 0| Bill 34 0.45 # 1| Bob 36 0.62 # 2| Mark 25 0.78 # 3| Sofia 31 0.42 # 4| Paul 29 0.89 # 5| Simon 21 0.57 average_age = df.average("age") # returns a Python float print(average_age) # 29.333333333333332
System.out.println(df); // _| name age score // 0| Bill 34 0.45 // 1| Bob 36 0.62 // 2| Mark 25 0.78 // 3| Sofia 31 0.42 // 4| Paul 29 0.89 // 5| Simon 21 0.57 double averageAge = df.average("age"); System.out.println(averageAge); // 29.333333333333332
Median
The median can be computed for all numeric Columns.
The following example shows how to compute the median:
print(df) # _| name age score # 0| Bill 34 0.45 # 1| Bob 36 0.62 # 2| Mark 25 0.78 # 3| Sofia 31 0.42 # 4| Paul 29 0.89 # 5| Simon 21 0.57 median_age = df.median("age") # returns a Python float print(median_age) # 30.0
System.out.println(df); // _| name age score // 0| Bill 34 0.45 // 1| Bob 36 0.62 // 2| Mark 25 0.78 // 3| Sofia 31 0.42 // 4| Paul 29 0.89 // 5| Simon 21 0.57 double medianAge = df.median("age"); System.out.println(medianAge); // 30.0
Sum
The sum can be computed for all numeric Columns.
The following example shows how to compute the sum:
print(df) # _| name age score # 0| Bill 34 0.45 # 1| Bob 36 0.62 # 2| Mark 25 0.78 # 3| Sofia 31 0.42 # 4| Paul 29 0.89 # 5| Simon 21 0.57 sum_scores = df.sum("score") print(sum_scores) # 3.73
System.out.println(df); // _| name age score // 0| Bill 34 0.45 // 1| Bob 36 0.62 // 2| Mark 25 0.78 // 3| Sofia 31 0.42 // 4| Paul 29 0.89 // 5| Simon 21 0.57 double sumScores = df.sum("score"); System.out.println(sumScores); // 3.73
Count
The occurrence of values can be counted. The are two ways that the count() method can be used. First, you can count the number of occurrences of all unique elements in a specific Column. The occurrences are modeled as a DataFrame. Second, you can specify an additional regular expression which defines the elements to count.
The following example shows how to count all values in a specific Column:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B groups = df.count("group") # returns a DataFrame print(groups) # _| group count % # 0| A 2 0.3333333432674408 # 1| B 3 0.5 # 2| C 1 0.1666666716337204
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame groups = df.count("group"); System.out.println(groups); // _| group count % // 0| A 2 0.33333334 // 1| B 3 0.5 // 2| C 1 0.16666667
You can pass an additional argument to the count() method to specify the regular expression to count matches in the specified Column for
The following example shows how to count specific elements:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B count_B = df.count("group", "B") # returns an int print(count_B) # 3
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B int countB = df.count("group", "B"); System.out.println(countB); // 3
Count Unique Values
You can directly count the number of unique values in a specific Column.
The following example shows how to count all unique values in a Column:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B val = df.count_unique("group") # returns an int print(val) # 3
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B int val = df.countUnique("group"); System.out.println(val); // 3
Unique Values
You can compute a set of all unique values in a specific Column.
The following example shows how to create a set holding all unique values inside a Column:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B groups = df.unique("group") # returns a Python set print(groups) # {'A', 'B', 'C'}
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B Set<Character> groups = df.unique("group"); System.out.println(groups); // [A, B, C]
Numerical Operations
The DataFrame API provides various methods to manipulate values in numerical Columns. These methods are bulk operations, which means that they apply to the entire Column.
Absolute
Changing numerical values to their absolute value essentially ensures that all values have a positive sign. The magnitude of each value, however, is not changed by this operation.
The following example shows how to use the absolute() method:
print(df) # _| name age active level # 0| Bill 34 True -42 # 1| Bob 36 False 12 # 2| Mark 25 True 56 # 3| Sofia 31 True -13 # 4| Paul 29 True -51 # 5| Simon 21 False -46 df.absolute("level") print(df) # _| name age active level # 0| Bill 34 True 42 # 1| Bob 36 False 12 # 2| Mark 25 True 56 # 3| Sofia 31 True 13 # 4| Paul 29 True 51 # 5| Simon 21 False 46
System.out.println(df); // _| name age active level // 0| Bill 34 true -42 // 1| Bob 36 false 12 // 2| Mark 25 true 56 // 3| Sofia 31 true -13 // 4| Paul 29 true -51 // 5| Simon 21 false -46 df.absolute("level"); System.out.println(df); // _| name age active level // 0| Bill 34 true 42 // 1| Bob 36 false 12 // 2| Mark 25 true 56 // 3| Sofia 31 true 13 // 4| Paul 29 true 51 // 5| Simon 21 false 46
Ceil
Ceiling numerical values replaces all values by the value returned by the mathematical ceil function.
The following example shows how to use the ceil() method:
print(df) # _| name age active level # 0| Bill 34 True -42.4 # 1| Bob 36 False 12.5 # 2| Mark 25 True 56.87 # 3| Sofia 31 True -13.1 # 4| Paul 29 True 51.9 # 5| Simon 21 False 46.01 df.ceil("level") print(df) # _| name age active level # 0| Bill 34 True -42.0 # 1| Bob 36 False 13.0 # 2| Mark 25 True 57.0 # 3| Sofia 31 True -13.0 # 4| Paul 29 True 52.0 # 5| Simon 21 False 47.0
System.out.println(df); // _| name age active level // 0| Bill 34 true -42.4 // 1| Bob 36 false 12.5 // 2| Mark 25 true 56.87 // 3| Sofia 31 true -13.1 // 4| Paul 29 true 51.9 // 5| Simon 21 false 46.01 df.ceil("level"); System.out.println(df); // _| name age active level // 0| Bill 34 true -42.0 // 1| Bob 36 false 13.0 // 2| Mark 25 true 57.0 // 3| Sofia 31 true -13.0 // 4| Paul 29 true 52.0 // 5| Simon 21 false 47.0
After ceiling values in a numerical Column you could convert that Column into, e.g. an IntColumn, without losing information.
Floor
Flooring numerical values replaces all values by the value returned by the mathematical floor function.
The following example shows how to use the floor() method:
print(df) # _| name age active level # 0| Bill 34 True -42.4 # 1| Bob 36 False 12.5 # 2| Mark 25 True 56.87 # 3| Sofia 31 True -13.1 # 4| Paul 29 True 51.9 # 5| Simon 21 False 46.01 df.floor("level") print(df) # _| name age active level # 0| Bill 34 True -43.0 # 1| Bob 36 False 12.0 # 2| Mark 25 True 56.0 # 3| Sofia 31 True -14.0 # 4| Paul 29 True 51.0 # 5| Simon 21 False 46.0
System.out.println(df); // _| name age active level // 0| Bill 34 true -42.4 // 1| Bob 36 false 12.5 // 2| Mark 25 true 56.87 // 3| Sofia 31 true -13.1 // 4| Paul 29 true 51.9 // 5| Simon 21 false 46.01 df.floor("level"); System.out.println(df); // _| name age active level // 0| Bill 34 true -43.0 // 1| Bob 36 false 12.0 // 2| Mark 25 true 56.0 // 3| Sofia 31 true -14.0 // 4| Paul 29 true 51.0 // 5| Simon 21 false 46.0
After flooring values in a numerical Column you could convert that Column into, e.g. an IntColumn, without losing information.
Round
Rounding numerical values replaces all values in a Column by the corresponding rounded value. You must specify the number of decimal places to round for.
The following example shows how to use the round() method:
print(df) # _| name age active level # 0| Bill 34 True -42.459 # 1| Bob 36 False 12.525 # 2| Mark 25 True 56.879 # 3| Sofia 31 True -13.148 # 4| Paul 29 True 51.999 # 5| Simon 21 False 46.452 df.round("level", 1) print(df) # _| name age active level # 0| Bill 34 True -42.5 # 1| Bob 36 False 12.5 # 2| Mark 25 True 56.9 # 3| Sofia 31 True -13.1 # 4| Paul 29 True 52.0 # 5| Simon 21 False 46.5
System.out.println(df); // _| name age active level // 0| Bill 34 true -42.459 // 1| Bob 36 false 12.525 // 2| Mark 25 true 56.879 // 3| Sofia 31 true -13.148 // 4| Paul 29 true 51.999 // 5| Simon 21 false 46.459 df.round("level", 1); System.out.println(df); // _| name age active level // 0| Bill 34 true -42.5 // 1| Bob 36 false 12.5 // 2| Mark 25 true 56.9 // 3| Sofia 31 true -13.1 // 4| Paul 29 true 52.0 // 5| Simon 21 false 46.5
Clip
Clipping numerical values ensures that all values in a Column are in a specified range by cutting off all values that lie outside of that range. The range can be open on either side, i.e. you don't necessarily have to specify both lower and upper clip boundaries.
The following example shows how to use the clip() method:
print(df) # _| name age active level # 0| Bill 34 True -42.49 # 1| Bob 36 False 12.52 # 2| Mark 25 True 56.87 # 3| Sofia 31 True -13.14 # 4| Paul 29 True 51.999 # 5| Simon 21 False 26.4 df.clip("level", -20, 30) print(df) # _| name age active level # 0| Bill 34 True -20.0 # 1| Bob 36 False 12.52 # 2| Mark 25 True 30.0 # 3| Sofia 31 True -13.14 # 4| Paul 29 True 30.0 # 5| Simon 21 False 26.4
System.out.println(df); // _| name age active level // 0| Bill 34 true -42.49 // 1| Bob 36 false 12.52 // 2| Mark 25 true 56.87 // 3| Sofia 31 true -13.14 // 4| Paul 29 true 51.999 // 5| Simon 21 false 26.4 df.clip("level", -20, 30); System.out.println(df); // _| name age active level // 0| Bill 34 true -20.0 // 1| Bob 36 false 12.52 // 2| Mark 25 true 30.0 // 3| Sofia 31 true -13.14 // 4| Paul 29 true 30.0 // 5| Simon 21 false 26.4
Sort Operations
The DataFrame API provides methods to sort all rows of a DataFrame based on the values in a specific Column.
The following example shows how to sort a DataFrame in ascending order:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B df.sort_by("age") # or alternatively: # df.sort_ascending_by("age") print(df) # _| name age active group # 0| Simon 21 False B # 1| Mark 25 True C # 2| Paul 29 True A # 3| Sofia 31 True B # 4| Bill 34 True A # 5| Bob 36 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B df.sortBy("age"); // or alternatively: // df.sortAscendingBy("age"); System.out.println(df); // _| name age active group // 0| Simon 21 false B // 1| Mark 25 true C // 2| Paul 29 true A // 3| Sofia 31 true B // 4| Bill 34 true A // 5| Bob 36 false B
In principle, you can sort by any Column. All values are sorted according to their natural order. For strings and chars this means that they are sorted lexicographically. Please note that values in BinaryColumns are sorted according to their length, i.e. the number of bytes in the byte array object.
You may also sort all rows in a DataFrame in descending order according to values in a specific Column.
The following example shows how to sort a DataFrame in descending order:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B df.sort_descending_by("age") print(df) # _| name age active group # 0| Bob 36 False B # 1| Bill 34 True A # 2| Sofia 31 True B # 3| Paul 29 True A # 4| Mark 25 True C # 5| Simon 21 False B
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B df.sortDescendingBy("age"); System.out.println(df); // _| name age active group // 0| Bob 36 false B // 1| Bill 34 true A // 2| Sofia 31 true B // 3| Paul 29 true A // 4| Mark 25 true C // 5| Simon 21 false B
Regardless of whether a DataFrame is sorted in ascending or descending order, when using a NullableDataFrame all null values are moved to the end of the underlying DataFrame. When sorting values with respect to a Column containing float or double values, then any NaN values are moved to the end of the DataFrame but before any null values.
The following example illustrates the sort behaviour for a Column holding double values in a DataFrame containing null values:
print(df) # _| name age active level # 0| Bill 34 True null # 1| Bob 36 False NaN # 2| Mark 25 True 12.3 # 3| Sofia 31 True null # 4| Paul 29 True 5.2 # 5| Simon 21 False NaN df.sort_by("level") print(df) # _| name age active level # 0| Paul 29 True 5.2 # 1| Mark 25 True 12.3 # 2| Bob 36 False NaN # 3| Simon 21 False NaN # 4| Sofia 31 True null # 5| Bill 34 True null
System.out.println(df); // _| name age active level // 0| Bill 34 true null // 1| Bob 36 false NaN // 2| Mark 25 true 12.3 // 3| Sofia 31 true null // 4| Paul 29 true 5.2 // 5| Simon 21 false NaN df.sortBy("level"); System.out.println(df); // _| name age active level // 0| Paul 29 true 5.2 // 1| Mark 25 true 12.3 // 2| Bob 36 false NaN // 3| Simon 21 false NaN // 4| Sofia 31 true null // 5| Bill 34 true null
Please note that the order between equal elements is not defined. Particularly, the sorting algorithm does not have to be stable.
Set Operations
The DataFrame API provides various methods for set-theoretic operations. These operations can either be performed with respect to columns or rows. Three basic set-theoretic operations are supported: difference, union and intersection. All of these operations are performed by considering either column or rows as two sets of two separate DataFrames.
Warning:
Set-theoretic operations on Columns only copy the references to the corresponding Column instances. Changing the row structure of the result DataFrames can lead to an incoherent DataFrame state. Always copy the DataFrame returned by set-theoretic column operations when subsequently changing the row structure!Difference Columns
This operation computes the set-theoretic difference with regard to all columns in two distinct DataFrame instances.
The following example shows how to compute the column difference of two DataFrames:
print(df) # _| name age active level # 0| Bill 34 True 42.49 # 1| Bob 36 False 12.52 # 2| Mark 25 True 56.87 # 3| Sofia 31 True 13.14 # 4| Paul 29 True 51.999 # 5| Simon 21 False 26.4 print(df2) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B diff = df.difference_columns(df2) # returns a DataFrame print(diff) # _| level group # 0| 42.49 A # 1| 12.52 B # 2| 56.87 C # 3| 13.14 B # 4| 51.999 A # 5| 26.4 B
System.out.println(df); // _| name age active level // 0| Bill 34 true 42.49 // 1| Bob 36 false 12.52 // 2| Mark 25 true 56.87 // 3| Sofia 31 true 13.14 // 4| Paul 29 true 51.999 // 5| Simon 21 false 26.4 System.out.println(df2); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame diff = df.differenceColumns(df2); System.out.println(diff); // _| level group // 0| 42.49 A // 1| 12.52 B // 2| 56.87 C // 3| 13.14 B // 4| 51.999 A // 5| 26.4 B
Union Columns
This operation computes the set-theoretic union with regard to all columns in two distinct DataFrame instances. The DataFrame returned by this operation contains the references to all Columns of the DataFrame that the method is called upon and the specified DataFrame, omitting any duplicate columns from the argument DataFrame
The following example shows how to compute the column union of two DataFrames:
print(df) # _| name age active level # 0| Bill 34 True 42.49 # 1| Bob 36 True 12.52 # 2| Mark 25 True 56.87 # 3| Sofia 31 True 13.14 # 4| Paul 29 True 51.999 # 5| Simon 21 True 26.4 print(df2) # _| name active group # 0| Bill False A # 1| Bob False B # 2| Mark False C # 3| Sofia False B # 4| Paul False A # 5| Simon False B union = df.union_columns(df2) # returns a DataFrame print(union) # _| name age active level group # 0| Bill 34 True 42.49 A # 1| Bob 36 True 12.52 B # 2| Mark 25 True 56.87 C # 3| Sofia 31 True 13.14 B # 4| Paul 29 True 51.999 A # 5| Simon 21 True 26.4 B
System.out.println(df); // _| name age active level // 0| Bill 34 true 42.49 // 1| Bob 36 true 12.52 // 2| Mark 25 true 56.87 // 3| Sofia 31 true 13.14 // 4| Paul 29 true 51.999 // 5| Simon 21 true 26.4 System.out.println(df2); // _| name active group // 0| Bill false A // 1| Bob false B // 2| Mark false C // 3| Sofia false B // 4| Paul false A // 5| Simon false B DataFrame union = df.unionColumns(df2); System.out.println(union); // _| name age active level group // 0| Bill 34 true 42.49 A // 1| Bob 36 true 12.52 B // 2| Mark 25 true 56.87 C // 3| Sofia 31 true 13.14 B // 4| Paul 29 true 51.999 A // 5| Simon 21 true 26.4 B
Intersection Columns
This operation computes the set-theoretic intersection with regard to all columns in two distinct DataFrame instances. The DataFrame returned by this operation contains the references to all Columns of the DataFrame that the method is called upon that are also in the specified DataFrame
The following example shows how to compute the column intersection of two DataFrames:
print(df) # _| name age active level # 0| Bill 34 True 42.49 # 1| Bob 36 True 12.52 # 2| Mark 25 True 56.87 # 3| Sofia 31 True 13.14 # 4| Paul 29 True 51.999 # 5| Simon 21 True 26.4 print(df2) # _| name active group # 0| Bill False A # 1| Bob False B # 2| Mark False C # 3| Sofia False B # 4| Paul False A # 5| Simon False B intersec = df.intersection_columns(df2) # returns a DataFrame print(intersec) # _| name active # 0| Bill True # 1| Bob True # 2| Mark True # 3| Sofia True # 4| Paul True # 5| Simon True
System.out.println(df); // _| name age active level // 0| Bill 34 true 42.49 // 1| Bob 36 true 12.52 // 2| Mark 25 true 56.87 // 3| Sofia 31 true 13.14 // 4| Paul 29 true 51.999 // 5| Simon 21 true 26.4 System.out.println(df2); // _| name active group // 0| Bill false A // 1| Bob false B // 2| Mark false C // 3| Sofia false B // 4| Paul false A // 5| Simon false B DataFrame intersec = df.intersectionColumns(df2); System.out.println(intersec); // _| name active // 0| Bill true // 1| Bob true // 2| Mark true // 3| Sofia true // 4| Paul true // 5| Simon true
Difference Rows
This operation computes the set-theoretic difference with regard to all rows in two distinct DataFrame instances.
The following example shows how to compute the row difference of two DataFrames:
print(df) # _| A B C # 0| aaa 11 True # 1| aab 22 False # 2| aac 33 True # 3| aad 44 False # 4| aae 55 True print(df2) # _| A B C # 0| aaa 11 True # 1| aab 22 False # 2| aac 33 True # 3| aad 44 False # 3| ccc 22 False # 4| ccc 33 True diff = df.difference_rows(df2) # returns a DataFrame print(diff) # _| A B C # 0| aae 55 True # 3| ccc 22 False # 4| ccc 33 True
System.out.println(df); // _| A B C // 0| aaa 11 true // 1| aab 22 false // 2| aac 33 true // 3| aad 44 false // 4| aae 55 true System.out.println(df2); // _| A B C // 0| aaa 11 true // 1| aab 22 false // 2| aac 33 true // 3| aad 44 false // 3| ccc 22 false // 4| ccc 33 true DataFrame diff = df.differenceRows(df2); System.out.println(diff); // _| A B C // 0| aae 55 true // 3| ccc 22 false // 4| ccc 33 true
Union Rows
This operation computes the set-theoretic union with regard to all rows in two distinct DataFrame instances.
The following example shows how to compute the row union of two DataFrames:
print(df) # _| A B C # 0| aaa 11 True # 1| aab 22 False # 2| aac 33 True print(df2) # _| A B C # 0| aaa 11 True # 1| aab 22 False # 2| ccc 33 False union = df.union_rows(df2) # returns a DataFrame print(union) # _| A B C # 0| aaa 11 True # 1| aab 22 False # 2| aac 33 True # 3| ccc 33 False
System.out.println(df); // _| A B C // 0| aaa 11 true // 1| aab 22 false // 2| aac 33 true System.out.println(df2); // _| A B C // 0| aaa 11 true // 1| aab 22 false // 2| ccc 33 false DataFrame union = df.unionRows(df2); System.out.println(union); // _| A B C // 0| aaa 11 true // 1| aab 22 false // 2| aac 33 true // 3| ccc 33 false
Intersection Rows
This operation computes the set-theoretic intersection with regard to all rows in two distinct DataFrame instances.
The following example shows how to compute the row intersection of two DataFrames:
print(df) # _| A B C # 0| aaa 11 True # 1| aab 22 False # 2| aac 33 True print(df2) # _| A B C # 0| ccc 33 False # 1| aaa 11 True # 2| aab 22 False intersec = df.intersection_rows(df2) # returns a DataFrame print(intersec) # _| A B C # 0| aaa 11 True # 1| aab 22 False
System.out.println(df); // _| A B C // 0| aaa 11 true // 1| aab 22 false // 2| aac 33 true System.out.println(df2); // _| A B C // 0| ccc 33 false // 1| aaa 11 true // 2| aab 22 false DataFrame intersec = df.intersectionRows(df2); System.out.println(intersec); // _| A B C // 0| aaa 11 true // 1| aab 22 false
Group Operations
The DataFrame API provides various methods to group elements together and aggregate values from other numerical Columns by applying a statistical operation.
Minimum
This operation groups values from the specified Column and computes the minimum of all numerical Columns for each group.
The following example shows how to group minimum values:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B minima = df.group_minimum_by("group") # returns a DataFrame print(minima) # _| group age # 0| A 29 # 1| B 21 # 2| C 25
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame minima = df.groupMinimumBy("group"); System.out.println(minima); // _| group age // 0| A 29 // 1| B 21 // 2| C 25
Maximum
This operation groups values from the specified Column and computes the maximum of all numerical Columns for each group.
The following example shows how to group maximum values:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B maxima = df.group_maximum_by("group") # returns a DataFrame print(maxima) # _| group age # 0| A 34 # 1| B 36 # 2| C 25
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame maxima = df.groupMaximumBy("group"); System.out.println(maxima); // _| group age // 0| A 34 // 1| B 36 // 2| C 25
Average
This operation groups values from the specified Column and computes the average of all numerical Columns for each group.
The following example shows how to group average values:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B averages = df.group_average_by("group") # returns a DataFrame print(averages) # _| group age # 0| A 31.5 # 1| B 29.333333333333332 # 2| C 25.0
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame averages = df.groupAverageBy("group"); System.out.println(averages); // _| group age // 0| A 31.5 // 1| B 29.333333333333332 // 2| C 25.0
Sum
This operation groups values from the specified Column and computes the sum of all numerical Columns for each group.
The following example shows how to group sum values:
print(df) # _| name age active group # 0| Bill 34 True A # 1| Bob 36 False B # 2| Mark 25 True C # 3| Sofia 31 True B # 4| Paul 29 True A # 5| Simon 21 False B sums = df.group_sum_by("group") # returns a DataFrame print(sums) # _| group age # 0| A 63.0 # 1| B 88.0 # 2| C 25.0
System.out.println(df); // _| name age active group // 0| Bill 34 true A // 1| Bob 36 false B // 2| Mark 25 true C // 3| Sofia 31 true B // 4| Paul 29 true A // 5| Simon 21 false B DataFrame sums = df.groupSumBy("group"); System.out.println(sums); // _| group age // 0| A 63.0 // 1| B 88.0 // 2| C 25.0
Join Operations
The DataFrame API provides a method to perform joins on DataFrame instances. This is essentially equivalent to an SQL inner-join operation. Therefore, even though the join() method is called on a DataFrame instance, the operation itself is commutative.
The following example shows how to use the join() method when specifying both column names:
print(df) # _| id name age # 0| 101 Bill 34 # 1| 102 Bob 36 # 2| 103 Mark 25 # 3| 104 Paul 29 print(df2) # _| key active group # 0| 101 True A # 1| 102 False B # 2| 103 True C result = df.join(df2, "id", "key") # returns a DataFrame print(result) # _| id name age active group # 0| 101 Bill 34 True A # 1| 102 Bob 36 False B # 2| 103 Mark 25 True C
System.out.println(df); // _| id name age // 0| 101 Bill 34 // 1| 102 Bob 36 // 2| 103 Mark 25 // 3| 104 Paul 29 System.out.println(df2); // _| key active group // 0| 101 true A // 1| 102 false B // 2| 103 true C DataFrame result = df.join(df2, "id", "key"); System.out.println(result); // _| id name age active group // 0| 101 Bill 34 true A // 1| 102 Bob 36 false B // 2| 103 Mark 25 true C
Optionally, when both DataFrames involved in a join operation have exactly one Column with an identical name in common, you can omit the specification of the join keys when calling the join() method.
The following example illustrates this:
print(df) # _| id name age # 0| 101 Bill 34 # 1| 102 Bob 36 # 2| 103 Mark 25 # 3| 104 Paul 29 print(df2) # _| id active group # 0| 101 True A # 1| 102 False B # 2| 103 True C result = df.join(df2) print(result) # _| id name age active group # 0| 101 Bill 34 True A # 1| 102 Bob 36 False B # 2| 103 Mark 25 True C
System.out.println(df); // _| id name age // 0| 101 Bill 34 // 1| 102 Bob 36 // 2| 103 Mark 25 // 3| 104 Paul 29 System.out.println(df2); // _| id active group // 0| 101 true A // 1| 102 false B // 2| 103 true C DataFrame result = df.join(df2); System.out.println(result); // _| id name age active group // 0| 101 Bill 34 true A // 1| 102 Bob 36 false B // 2| 103 Mark 25 true C
Utilities
The DataFrame API provides a collection of standard methods for various purposes. All methods that do not directly fit into one of the already described sections are explained in the following.
Info
The info() method provides a descriptive string summarizing the main properties of a DataFrame without showing the actual data. The actual string returned by the method is not strictly defined by the specification and may therefore be implementation dependent.
The following example shows how the info string looks like:
print(df) # _| id name age active group # 0| 101 Bill 34 True A # 1| 102 Bob 36 False B # 2| 103 Mark 25 True C # 3| 104 Sofia 31 True B print(df.info()) # Type: Default # Columns: 5 # Rows: 4 # _| column type code # 0| id int 3 # 1| name string 5 # 2| age short 2 # 3| active boolean 9 # 4| group char 8
System.out.println(df); // _| id name age active group // 0| 101 Bill 34 true A // 1| 102 Bob 36 false B // 2| 103 Mark 25 true C // 3| 104 Sofia 31 true B System.out.println(df.info()); // Type: Default // Columns: 5 // Rows: 4 // _| column type code // 0| id int 3 // 1| name string 5 // 2| age short 2 // 3| active boolean 9 // 4| group char 8
The code column in the above info-string DataFrame refers to the unique type code of the corresponding Column instance, e.g. a (non-nullable) StringColumn has a unique type code of 5.
To Array
A DataFrame can be converted to a plain two-dimensional array/list. All values are copies of the values inside the underlying DataFrame, with the exception of byte arrays of BinaryColumns.
The following code gives an example of how to get a DataFrame as an array/list of objects:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True array = df.to_array() # returns a list of lists print(array) # [[101, 102, 103], ['Bill', 'Bob', 'Mark'], [True, False, True]]
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true Object[] array = df.toArray(); System.out.println(Arrays.deepToString(array)); // [[101, 102, 103], [Bill, Bob, Mark], [true, false, true]]
To String
A DataFrame can be represented as a string. This can be very helpful when working with DataFrames interactively or when debugging.
The following example shows a string representation of a DataFrame:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True string = df.to_string() # returns a Python str print(string) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true String string = df.toString(); System.out.println(string); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true
Clone
A DataFrame can be cloned (copied) in its entirety. This will create a deep copy, i.e. all values including byte arrays of BinaryColumns are copied to a new DataFrame instance.
The following example shows how to copy a DataFrame:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True copy = df.clone() # or alternatively: # copy = DataFrame.copy(df) print(copy) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true DataFrame copy = df.clone(); // or alternatively: // DataFrame copy = DataFrame.copy(df); System.out.println(copy); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true
Hash Code
A hash code can be generated from any DataFrame. Please note that the hash values does not need to be platform independent. Even two identical DataFrames in separate system processes do not necessarily have the same hash code. The method for computing the hash code value for a DataFrame simply returns an integer value with at least 32 bits worth of information. Therefore, hash code values are not sufficiently resistant to hash collisions.
Warning:
Do not use hash codes to determine if two DataFrames are equal. Use the equals() method instead!The following example shows how to compute a hash code values:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True hashcode = df.hash_code() # returns an int # or alternatively: # hashcode = hash(df) print(hashcode) # 1486664104986588480
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true int hashcode = df.hashCode(); System.out.println(hashcode); // -843453324
Equals
Whether two DataFrames are equal can be determined with the equals() method. Two DataFrames are equal if they have the same column structure and all corresponding rows elements are equal. The order of Columns in both DataFrames is taken into consideration when checking for equality.
The following example shows how to check if two DataFrames are equal:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True df2 = df.clone() is_equal = df.equals(df2) # returns a Python bool # or alternatively: # is_equal = df == df2 print(is_equal) # True df2.set_boolean("active", 1, True) is_equal = df.equals(df2) print(is_equal) # False
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true DataFrame df2 = df.clone(); boolean isEqual = df.equals(df2); System.out.println(isEqual); // true df2.setBoolean("active", 1, true); isEqual = df.equals(df2); System.out.println(isEqual); // false
Memory Usage
The memory usage of a DataFrame can be approximately determined at runtime. Please note that the value determined by the method is only an approximation comparable to the size of the payload data in a serialized form. The memory usage is denoted in bytes and might actually be higher than the value returned by the method.
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True val = df.memory_usage() # returns an int print(val) # 39
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true int val = df.memoryUsage(); System.out.println(val); // 27
Flush
A flush operation shrinks the internally used array of each Column to match the actual needed size of the DataFrame. This can be used in a situation where unecessary space allocation should get freed to reduce the overall memory footprint of a process.
The following example shows how to flush a DataFrame:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False print(df.capacity()) # 2 df.add_row([103, "Mark", True]) print(df.capacity()) # 4 df.flush() print(df.capacity()) # 3
System.out.println(df); // _| id name active // 0| 101 Bill True // 1| 102 Bob False System.out.println(df.capacity()); // 2 df.addRow(103, "Mark", true); System.out.println(df.capacity()); // 4 df.flush(); System.out.println(df.capacity()); // 3
Merge
Multiple DataFrames can be merged into one DataFrame. Merging is performed with respect to all Columns. If DataFrames have duplicate Column names then the offending Columns are automatically renamed.
The following example demonstrates how to merge DataFrames:
print(df) # _| id name # 0| 101 Bill # 1| 102 Bob print(df2) # _| name active # 0| Smith True # 1| Swanson False print(df3) # _| age level # 0| 34 1.5 # 1| 36 2.3 res = DataFrame.merge(df, df2, df3) # returns a DataFrame print(res) # _| id name_0 name_1 active age level # 0| 101 Bill Smith True 34 1.5 # 1| 102 Bob Swanson False 36 2.3
System.out.println(df); // _| id name // 0| 101 Bill // 1| 102 Bob System.out.println(df2); // _| name active // 0| Smith true // 1| Swanson false System.out.println(df3); // _| age level // 0| 34 1.5 // 1| 36 2.3 DataFrame res = DataFrame.merge(df, df2, df3); System.out.println(res); // _| id name_0 name_1 active age level // 0| 101 Bill Smith True 34 1.5 // 1| 102 Bob Swanson False 36 2.3
Like
The static function like() creates a new DataFrame instance with the same column structure as the provided DataFrame argument. The returned DataFrame is empty.
The following example shows how to copy the column structure of a DataFrame:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True df2 = DataFrame.like(df) # returns a DataFrame print(df2) # __| id name active print(df2.is_empty()) # True
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true DataFrame df2 = DataFrame.like(df); System.out.println(df2); // __| id name active System.out.println(df2.isEmpty()); // true
I/O Support
The DataFrame API provides standard functions for serialization support.
Serialization
You may serialize any DataFrame into an array of bytes. These byte arrays can be deserialized again which restores the original DataFrame object.
The following example shows DataFrame serialization:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True val = DataFrame.serialize(df) # returns a Python bytearray print(val.hex()) # 7b763a323b64 ... 4d61726b00a0 df = DataFrame.deserialize(val) # returns a DataFrame print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true byte[] val = DataFrame.serialize(df); System.out.println(Arrays.toString(val)); // [123, 118, 58, 50, 59, 100, ... , 77, 97, 114, 107, 0, -96] df = DataFrame.deserialize(val); System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true
The byte array in the above example is not compressed. You can compress the byte array by passing an additional boolean flag to the serialize() function. The following example shows how to serialize a DataFrame to a comressed byte array:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True val = DataFrame.serialize(df, compress=True) print(val.hex()) # 6466ab2eb332 ... 020080d80d6e df = DataFrame.deserialize(val) print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True
import static com.raven.common.io.DataFrameSerializer.MODE_COMPRESSED; System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true byte[] val = DataFrame.serialize(df, MODE_COMPRESSED); // or alternatively: // byte[] val = DataFrame.serialize(df, true); System.out.println(Arrays.toString(val)); // [100, 102, -85, 46, -77, 50, ... , 2, 0, -128, -40, 13, 110] df = DataFrame.deserialize(val); System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true
The byte array is automatically decompressed if applicable. Please note that both compression and decompression requires additional runtime.
Read and Write Files
You may persist a DataFrame to a file. The file extension for DataFrame files is .df
The following example shows how to persist a DataFrame to a file and read that DataFrame from the file again:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True DataFrame.write("myFile.df", df) df = DataFrame.read("myFile.df") print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true DataFrame.write("myFile.df", df); df = DataFrame.read("myFile.df"); System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true
Base64 Encoding
You may encode a DataFrame to a Base64-encoded string.
The following example shows how to encode a DataFrame to Base64:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True string = DataFrame.to_base64(df) # returns a str print(string) # ZGarLr ... gNbg== df = DataFrame.from_base64(string) print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True
System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true String string = DataFrame.toBase64(df); System.out.println(string); // ZGarLr ... gNbg== df = DataFrame.fromBase64(string); System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true
CSV Files
You can read a CSV file into a DataFrame and write a DataFrame to a CSV file. Although not strictly specified by the DataFrame specification, all available implementations provide support for handling CSV files.
The following example shows how to read a CSV file from the filesystem:
# id,name,active # 101,Bill,True # 102,Bob,False # 103,Mark,True df = DataFrame.read_csv("myFile.csv") # returns a DataFrame print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True
// id,name,active // 101,Bill,True // 102,Bob,False // 103,Mark,True import com.raven.common.io.CSVReader; DataFrame df = new CSVReader("myFile.csv").read(); System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true
All Columns in the returned DataFrame are StringColumns because CSV files do not carry any type information. You can explicitly define the column types in the returned DataFrame when reading a file.
The following example shows how to read a CSV file and specify the column types:
# id,name,active # 101,Bill,True # 102,Bob,False # 103,Mark,True df = DataFrame.read_csv("myFile.csv", types=("int", "string", "boolean")) print(df.info()) # Type: Default # Columns: 3 # Rows: 3 # _| column type code # 0| id int 3 # 1| name string 5 # 2| active boolean 9
// id,name,active // 101,Bill,True // 102,Bob,False // 103,Mark,True import com.raven.common.io.CSVReader; DataFrame df = new CSVReader("myFile.csv") .useColumnTypes(Integer.class, String.class, Boolean.class) .read(); System.out.println(df.info()); // Type: Default // Columns: 3 // Rows: 3 // _| column type code // 0| id int 3 // 1| name string 5 // 2| active boolean 9
The corresponding functions for reading CSV files have more parameters. See the source code documentation for details.
You can write a DataFrame to a CSV file in a similar way.
The following example shows how to write a DataFrame to a CSV file:
print(df) # _| id name active # 0| 101 Bill True # 1| 102 Bob False # 2| 103 Mark True DataFrame.write_csv("myFile.csv", df)
import com.raven.common.io.CSVWriter; System.out.println(df); // _| id name active // 0| 101 Bill true // 1| 102 Bob false // 2| 103 Mark true new CSVWriter("myFile.csv").write(df);
The corresponding functions for writing CSV files have more parameters. See the source code documentation for details.