|
1 | | -dataframe-js |
2 | | -============ |
| 1 | +# dataframe-js |
| 2 | +**v0.2.0** |
3 | 3 |
|
| 4 | +## Presentation |
| 5 | + |
| 6 | +dataframe-js provides another way to work with data by using DataFrame, a powerfull data structure already used in some languages (Spark, Python, R, ...). |
| 7 | + |
| 8 | +A DataFrame is simply built on two concepts: |
| 9 | +- **Columns** providing ways to select your data and reorganize them. |
| 10 | +- **Rows** providing ways to modify or filter your data. |
| 11 | + |
| 12 | +````javascript |
| 13 | +const df = new DataFrame(rawData, columns) |
| 14 | +df.show() |
| 15 | +// DataFrame example |
| 16 | +| column1 | column2 | column3 | <--- Columns |
| 17 | +------------------------------------ |
| 18 | +| 3 | 3 | undefined | <--- Row |
| 19 | +| 6 | 4 | undefined | |
| 20 | +| 8 | 5 | undefined | |
| 21 | +| undefined | 6 | undefined | |
| 22 | +```` |
| 23 | + |
| 24 | +**DataFrame is immutable** (lazy, for performance purposes). Then, each modification on DataFrame will return a new DataFrame decreasing side effects and making your data more secure. |
| 25 | + |
| 26 | +**DataFrame is easy to use** with a simple API (closed to Spark or SQL) designed to manipulate data faster and easier than ever. |
| 27 | + |
| 28 | +**DataFrame is flexible** because you can create DataFrames from multiple data format (array, object) and you can export your DataFrames into these (array, object, csv, json...). |
| 29 | + |
| 30 | +**DataFrame is modulable** because you can use additional modules (Stat and Matrix by default) or create your own. |
| 31 | + |
| 32 | +## Installation |
| 33 | + |
| 34 | +`npm install git+http://93.15.96.71:10080/odin/dataframe-js.git#feature/begin` |
| 35 | + |
| 36 | +## Manual |
| 37 | + |
| 38 | +dataframe-js contains a **principal core (DataFrame and Row)** and **two default modules (Stat and Matrix)**. Refer to this manual to use them. You can also directly read unit tests in `./tests/` or documented code in `./src/`. |
| 39 | + |
| 40 | +### Core |
| 41 | + |
| 42 | +#### DataFrame and Row API documentation: [Core API](./doc/CORE_API.md) |
| 43 | + |
| 44 | +#### Usage: |
| 45 | + |
| 46 | +To use dataframe-js, simply import the library. Then you can use DataFrame, Row or other Core components. |
| 47 | + |
| 48 | +```javascript |
| 49 | +import { DataFrame, Row } from 'dataframe-js'; |
| 50 | +``` |
| 51 | + |
| 52 | +To create a DataFrame, you have to passe your data and your column names. You can use different data structures as below: |
| 53 | + |
| 54 | +```javascript |
| 55 | +const df = new DataFrame(myData, myColumns); |
| 56 | + |
| 57 | +const dfFromObjectOfArrays = new DataFrame({ |
| 58 | + column1: [3, 6, 8], //<------ A column |
| 59 | + column2: [3, 4, 5, 6], |
| 60 | +}, ['column1', 'column2']); |
| 61 | + |
| 62 | +const dfFromArrayOfArrays = new DataFrame([ |
| 63 | + [1, 6, 9, 10, 12], // <------- A row |
| 64 | + [1, 2], |
| 65 | + [6, 6, 9, 8, 9, 12], |
| 66 | +], ['c1', 'c2', 'c3', 'c4', 'c5', 'c6']); |
| 67 | + |
| 68 | +const dfFromArrayOfObjects = new DataFrame([ |
| 69 | + {c1: 1, c2: 6}, // <--- A row |
| 70 | + {c4: 1, c3: 2} |
| 71 | +], ['c1', 'c2', 'c3', 'c4']); |
| 72 | +``` |
| 73 | + |
| 74 | +If you don't pass column names, they will be infered from your data but **it's slower**: |
| 75 | + |
| 76 | +```javascript |
| 77 | +// here you don't pass column names |
| 78 | +const dfFromObjectOfArrays = new DataFrame({ |
| 79 | + column1: [3, 6, 8], //<------ A column |
| 80 | + column2: [3, 4, 5, 6], |
| 81 | +}); |
| 82 | + |
| 83 | +console.log(dfFromObjectOfArrays.listColumns()) |
| 84 | +// ['column1', 'column2'] |
| 85 | + |
| 86 | +const dfFromArrayOfArrays = new DataFrame([ |
| 87 | + [1, 6, 9, 10, 12], // <------- A row |
| 88 | + [1, 2], |
| 89 | + [6, 6, 9, 8, 9, 12], |
| 90 | +]); |
| 91 | + |
| 92 | +console.log(dfFromArrayOfArrays.listColumns()) |
| 93 | +// ['0', '1', '2', '3', '4', '5'] |
| 94 | + |
| 95 | + |
| 96 | +const dfFromArrayOfObjects = new DataFrame([ |
| 97 | + {c1: 1, c2: 6}, // <--- A row |
| 98 | + {c4: 1, c3: 2} |
| 99 | +]); |
| 100 | + |
| 101 | +console.log(dfFromArrayOfObjects.listColumns()) |
| 102 | +// ['c1', 'c2', 'c3', 'c4'] |
| 103 | +``` |
| 104 | + |
| 105 | +Of course, you can do the reverse by exporting your DataFrame in another format by using: |
| 106 | +* [.toDict()](./doc/CORE_API.md#DataFrame+toDict) ⇒ <code>Object</code> |
| 107 | +* [.toArray()](./doc/CORE_API.md#DataFrame+toArray) ⇒ <code>Array</code> |
| 108 | +* [.toText([sep], [header], [path])](./doc/CORE_API.md#DataFrame+toText) ⇒ <code>String</code> |
| 109 | +* [.toCSV([header], [path])](./doc/CORE_API.md#DataFrame+toCSV) ⇒ <code>String</code> |
| 110 | +* [.toJSON([path])](./doc/CORE_API.md#DataFrame+toJSON) ⇒ <code>String</code> |
| 111 | + |
| 112 | +or you can debug by using: |
| 113 | +* [.show([rows], [quiet])](./doc/CORE_API.md#DataFrame+show) ⇒ <code>String</code> |
| 114 | + |
| 115 | +When you realize some operations on a DataFrame (or on a Row), it is never mutated. Indeed, when you modify a DataFrame (even if nothing change) you create a new instance of DataFrame. It's a bit slower but you avoid side effects. |
| 116 | + |
| 117 | +Examples: |
| 118 | +```javascript |
| 119 | +// When you change the DataFrame structure, the original DataFrame doesn't change. |
| 120 | +df.drop('column1'); // <--- Here you drop a column. |
| 121 | +console.log(df.listColumns()); |
| 122 | +// But nothing change in df. |
| 123 | +// You didn't mutated it. You just have created a new instance of DataFrame. |
| 124 | +// ['column1', 'column2', 'column3'] |
| 125 | + |
| 126 | +// Here you declare a new variable (const) to save the modified df. |
| 127 | +const df2 = df.drop('column1'); |
| 128 | +console.log(df2.listColumns()); |
| 129 | +// ['column2', 'column3'] |
| 130 | + |
| 131 | +console.log(Object.is(df2.dim(), df.dim())); |
| 132 | +// false, they didn't have the same dimensions. df2 is no longer an instance of df. |
| 133 | +console.log( |
| 134 | + Object.is( |
| 135 | + df2.map(row => row), |
| 136 | + df2 |
| 137 | + ) |
| 138 | +); |
| 139 | +// false. a modification of df2 send another instance of DataFrame, even if nothing change. |
| 140 | + |
| 141 | +// if we create a new column |
| 142 | +df2.withColumn('anewcolumn', row => row.get('column2') + 8); |
| 143 | +console.log( |
| 144 | + df2.select('anewcolumn') |
| 145 | +); |
| 146 | +// NoSuchColumnError |
| 147 | +// df2 wasn't mutated |
| 148 | + |
| 149 | +``` |
| 150 | + |
| 151 | +For more informations about all DataFrame manipulations you can find the API below. |
| 152 | + |
| 153 | +#### List of available methods and their examples: |
| 154 | + |
| 155 | +* [DataFrame](./doc/CORE_API.md#DataFrame) |
| 156 | + * [new DataFrame(data, columns, [...modules])](#new_DataFrame_new) |
| 157 | + * [.toDict()](./doc/CORE_API.md#DataFrame+toDict) ⇒ <code>Object</code> |
| 158 | + * [.toArray()](./doc/CORE_API.md#DataFrame+toArray) ⇒ <code>Array</code> |
| 159 | + * [.toText([sep], [header], [path])](./doc/CORE_API.md#DataFrame+toText) ⇒ <code>String</code> |
| 160 | + * [.toCSV([header], [path])](./doc/CORE_API.md#DataFrame+toCSV) ⇒ <code>String</code> |
| 161 | + * [.toJSON([path])](./doc/CORE_API.md#DataFrame+toJSON) ⇒ <code>String</code> |
| 162 | + * [.push(...rows)](#DataFrame+push) ⇒ <code>[DataFrame](#DataFrame)</code> |
| 163 | + * [.dim()](./doc/CORE_API.md#DataFrame+dim) ⇒ <code>Array</code> |
| 164 | + * [.transpose()](./doc/CORE_API.md#DataFrame+transpose) ⇒ <code>ÐataFrame</code> |
| 165 | + * [.count()](./doc/CORE_API.md#DataFrame+count) ⇒ <code>Int</code> |
| 166 | + * [.countValue(valueToCount, [columnName])](./doc/CORE_API.md#DataFrame+countValue) ⇒ <code>Int</code> |
| 167 | + * [.show([rows], [quiet])](./doc/CORE_API.md#DataFrame+show) ⇒ <code>String</code> |
| 168 | + * [.replace(value, replacment, [...columnNames])](./doc/CORE_API.md#DataFrame+replace) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 169 | + * [.distinct(columnName)](./doc/CORE_API.md#DataFrame+distinct) ⇒ <code>Array</code> |
| 170 | + * [.unique(columnName)](./doc/CORE_API.md#DataFrame+unique) ⇒ <code>Array</code> |
| 171 | + * [.listColumns()](./doc/CORE_API.md#DataFrame+listColumns) ⇒ <code>Array</code> |
| 172 | + * [.select(...columnNames)](./doc/CORE_API.md#DataFrame+select) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 173 | + * [.withColumn(columnName, [func])](./doc/CORE_API.md#DataFrame+withColumn) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 174 | + * [.restructure(newColumnNames)](./doc/CORE_API.md#DataFrame+restructure) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 175 | + * [.rename(newColumnNames)](./doc/CORE_API.md#DataFrame+rename) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 176 | + * [.drop(columnName)](./doc/CORE_API.md#DataFrame+drop) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 177 | + * [.chain(...funcs)](./doc/CORE_API.md#DataFrame+chain) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 178 | + * [.filter(func)](./doc/CORE_API.md#DataFrame+filter) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 179 | + * [.where(func)](./doc/CORE_API.md#DataFrame+where) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 180 | + * [.find(condition)](./doc/CORE_API.md#DataFrame+find) ⇒ <code>[Row](./doc/CORE_API.md#Row)</code> |
| 181 | + * [.map(func)](./doc/CORE_API.md#DataFrame+map) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 182 | + * [.reduce(func, [init])](./doc/CORE_API.md#DataFrame+reduce) ⇒ |
| 183 | + * [.reduceRight(func, [init])](./doc/CORE_API.md#DataFrame+reduceRight) ⇒ |
| 184 | + * [.shuffle()](./doc/CORE_API.md#DataFrame+shuffle) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 185 | + * [.sample(percentage)](./doc/CORE_API.md#DataFrame+sample) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 186 | + * [.randomSplit(percentage)](./doc/CORE_API.md#DataFrame+randomSplit) ⇒ <code>Array</code> |
| 187 | + * [.groupBy(columnName)](./doc/CORE_API.md#DataFrame+groupBy) ⇒ <code>Array</code> |
| 188 | + * [.sortBy(columnName, [reverse])](./doc/CORE_API.md#DataFrame+sortBy) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 189 | + * [.union(dfToUnion)](./doc/CORE_API.md#DataFrame+union) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 190 | + * [.join(dfToJoin, on, [how])](./doc/CORE_API.md#DataFrame+join) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 191 | + * [.innerJoin(dfToJoin, on)](./doc/CORE_API.md#DataFrame+innerJoin) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 192 | + * [.fullJoin(dfToJoin, on)](./doc/CORE_API.md#DataFrame+fullJoin) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 193 | + * [.outerJoin(dfToJoin, on)](./doc/CORE_API.md#DataFrame+outerJoin) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 194 | + * [.leftJoin(dfToJoin, on)](./doc/CORE_API.md#DataFrame+leftJoin) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 195 | + * [.rightJoin(dfToJoin, on)](./doc/CORE_API.md#DataFrame+rightJoin) ⇒ <code>[DataFrame](./doc/CORE_API.md#DataFrame)</code> |
| 196 | + |
| 197 | + |
| 198 | +* [Row](./doc/CORE_API.md#Row) |
| 199 | + * [new Row(data, columns)](#new_Row_new) |
| 200 | + * [.toDict()](./doc/CORE_API.md#Row+toDict) ⇒ <code>Object</code> |
| 201 | + * [.toArray()](./doc/CORE_API.md#Row+toArray) ⇒ <code>Array</code> |
| 202 | + * [.size()](./doc/CORE_API.md#Row+size) ⇒ <code>Int</code> |
| 203 | + * [.has(columnName)](./doc/CORE_API.md#Row+has) ⇒ <code>Boolean</code> |
| 204 | + * [.select(...columnNames)](./doc/CORE_API.md#Row+select) ⇒ <code>[Row](./doc/CORE_API.md#Row)</code> |
| 205 | + * [.get(columnToGet)](./doc/CORE_API.md#Row+get) ⇒ |
| 206 | + * [.set(columnToSet)](./doc/CORE_API.md#Row+set) ⇒ <code>[Row](./doc/CORE_API.md#Row)</code> |
| 207 | + * [.delete(columnToDel)](./doc/CORE_API.md#Row+delete) ⇒ <code>[Row](./doc/CORE_API.md#Row)</code> |
| 208 | + |
| 209 | + |
| 210 | + |
| 211 | +### Modules |
| 212 | + |
| 213 | +#### Stat and Matrix modules API documentation: [Modules API](./doc/MODULES_API.md) |
| 214 | + |
| 215 | +#### Usage: |
| 216 | + |
| 217 | +dataframe-js is designed to easily create and add modules in order to extends DataFrame tools. |
| 218 | + |
| 219 | +When you start an instance of DataFrame you can also pass modules which be available by calling their names. |
| 220 | + |
| 221 | +```javascript |
| 222 | +// Here you add two modules on your DataFrame instance. |
| 223 | +const df = new DataFrame(obj, ['column1', 'column2', 'column3'], fakeModule, anotherModule) |
| 224 | +// You can call modules by their names |
| 225 | +df.fakemodule.test(4) |
| 226 | +``` |
| 227 | + |
| 228 | +Modules will be also available for each DataFrame created from your first instance, avoiding to redeclare your modules each time you create a DataFrame. |
| 229 | + |
| 230 | +```javascript |
| 231 | +// You create a second DataFrame from the last one. |
| 232 | +const df2 = df.withColumn('column4', (row) => row.get('column2') * 2) |
| 233 | +// This second DataFrame will have acces to the same modules. |
| 234 | +df.fakemodule.test(8) |
| 235 | +``` |
| 236 | + |
| 237 | +If you want to create your own module, take a look at the Statisticical module (integrated by default) `./src/modules/stat.js` as example. |
| 238 | + |
| 239 | +A simple example of a module structure: |
| 240 | + |
| 241 | +```javascript |
| 242 | +class fakeModule { |
| 243 | + constructor(dataframe) { |
| 244 | + this.df = dataframe; |
| 245 | + this.name = 'fakemodule'; |
| 246 | + } |
| 247 | + |
| 248 | + test(x) { |
| 249 | + return this.df.withColumn('test', row => row.set('test', x * 2)); |
| 250 | + } |
| 251 | +} |
| 252 | +``` |
| 253 | + |
| 254 | +#### List of available modules |
| 255 | + |
| 256 | +* [Matrix](./doc/MODULES_API.md#Matrix) |
| 257 | + * [new Matrix(dataframe)](./doc/MODULES_API.md#new_Matrix_new) |
| 258 | + * [.hasSameStruct(df)](./doc/MODULES_API.md#Matrix+hasSameStruct) ⇒ <code>Boolean</code> |
| 259 | + * [.hasSameTransposedStruct(df)](./doc/MODULES_API.md#Matrix+hasSameTransposedStruct) ⇒ <code>Boolean</code> |
| 260 | + * [.add(df)](./doc/MODULES_API.md#Matrix+add) ⇒ <code>DataFrame</code> |
| 261 | + * [.product(number)](./doc/MODULES_API.md#Matrix+product) ⇒ <code>DataFrame</code> |
| 262 | + * [.dot(df)](./doc/MODULES_API.md#Matrix+dot) ⇒ <code>DataFrame</code> |
| 263 | + |
| 264 | +* [Stat](./doc/MODULES_API.md#Stat) |
| 265 | + * [new Stat(dataframe)](./doc/MODULES_API.md#new_Stat_new) |
| 266 | + * [.max(columnName)](./doc/MODULES_API.md#Stat+max) ⇒ <code>Number</code> |
| 267 | + * [.min(columnName)](./doc/MODULES_API.md#Stat+min) ⇒ <code>Number</code>² |
| 268 | + * [.mean(columnName)](./doc/MODULES_API.md#Stat+mean) ⇒ <code>Number</code> |
| 269 | + * [.var(columnName, [population])](./doc/MODULES_API.md#Stat+var) ⇒ <code>Number</code> |
| 270 | + * [.sd(columnName, [population])](./doc/MODULES_API.md#Stat+sd) ⇒ <code>Number</code> |
| 271 | + * [.stats(columnName)](./doc/MODULES_API.md#Stat+stats) ⇒ <code>Object</code> |
| 272 | + |
| 273 | +## Contribution |
| 274 | + |
| 275 | +[How to contribute ?](./CONTRIBUTING.md) |
0 commit comments