diff --git "a/Estat\303\255stica_Descritiva_Kauan_Santiago_(Projeto_1).ipynb" "b/Estat\303\255stica_Descritiva_Kauan_Santiago_(Projeto_1).ipynb" new file mode 100644 index 0000000..3109e20 --- /dev/null +++ "b/Estat\303\255stica_Descritiva_Kauan_Santiago_(Projeto_1).ipynb" @@ -0,0 +1,2706 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_KJSn9as96pO" + }, + "source": [ + "### Estatística Descritiva (Kauan Santiago Ferreira)\n", + " Engenheiro de Dados\n", + "\n", + "---\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wxow8PtQDf7R" + }, + "source": [ + "A Estatística é um ramo da Matemática voltado à análise dados. Ela se encontra na base do método científico, pois os dados obtidos em experimentos científicos devem ser analisados rigorosamente através de métodos da Estatística.\n", + "\n", + "A estatística descritiva, tem por objetivo sintetizar, ou seja, resumir uma série de valores de mesma natureza, permitindo dessa forma que se tenha uma visão global da variação desses valores. A estatística descritiva organiza e descreve os dados de três maneiras: por meio de tabelas, de gráficos e de\n", + "medidas descritivas.\n", + "\n", + "Como **medidas descritivas** podemos citar as medidas de tendência central, as medidas de dispersão e as medidas de distribuição.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PN2Pr4wt3ZCy" + }, + "source": [ + "## Base de dados de Pacientes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-kRhoDH99IKP" + }, + "source": [ + "Antes de conhecer as funções da linguagem python para obtenção das medidas descritivas precisamos carregar e explorar a base de dados que vamos utilizar para os cálculos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TB0Ld77v9m_R" + }, + "source": [ + "As instruções a seguir carregam uma base de dados de PACIENTES contendo informações como Identificação, Altura, Peso, Pulsação (batimentos por minuto), Pressão sistólica e diastólica, taxa de colesterol e Índice de Massa Corporal (IMC)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LGv_wDpYDizP", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "1aa53174-ebeb-4a00-ba6e-d7c6e07731eb" + }, + "source": [ + "import pandas as pd\n", + "#https://drive.google.com/file/d/1foZsAIE3sSoFf53ENs9QCQ29ZqcMOxac/view?usp=sharing\n", + "codigo = \"1foZsAIE3sSoFf53ENs9QCQ29ZqcMOxac\"\n", + "file = \"https://drive.google.com/u/3/uc?id=\" + codigo + \"&export=download\"\n", + "# Base de Dados de Pacientes\n", + "# Note a especificação de SEPARADOR (ponto e vírgula)\n", + "# Note a especificação de VÍRGULA DECIMAL\n", + "pacientes = pd.read_csv(file, sep = \";\",decimal=\",\")\n", + "#df = pd.read_csv(file)\n", + "pacientes.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Ident Idade Altura Peso Pulsacao Sistolica Diastolica Colesterol \\\n", + "0 1 58 1.80 76.1 68 125 78 522 \n", + "1 2 22 1.68 64.9 64 107 54 127 \n", + "2 3 32 1.82 80.7 88 126 81 740 \n", + "3 4 31 1.74 79.1 72 110 68 49 \n", + "4 5 28 1.72 68.7 64 110 66 230 \n", + "\n", + " IMC \n", + "0 23.5 \n", + "1 23.0 \n", + "2 24.3 \n", + "3 26.0 \n", + "4 23.3 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdentIdadeAlturaPesoPulsacaoSistolicaDiastolicaColesterolIMC
01581.8076.1681257852223.5
12221.6864.9641075412723.0
23321.8280.7881268174024.3
34311.7479.172110684926.0
45281.7268.7641106623023.3
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "pacientes", + "summary": "{\n \"name\": \"pacientes\",\n \"rows\": 21,\n \"fields\": [\n {\n \"column\": \"Ident\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 1,\n \"max\": 21,\n \"num_unique_values\": 21,\n \"samples\": [\n 1,\n 18,\n 16\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Idade\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16,\n \"min\": 17,\n \"max\": 73,\n \"num_unique_values\": 15,\n \"samples\": [\n 54,\n 73,\n 58\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Altura\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.08453514343863432,\n \"min\": 1.56,\n \"max\": 1.94,\n \"num_unique_values\": 16,\n \"samples\": [\n 1.8,\n 1.68,\n 1.76\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Peso\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12.375115199079005,\n \"min\": 60.8,\n \"max\": 106.7,\n \"num_unique_values\": 21,\n \"samples\": [\n 76.1,\n 99.3,\n 106.7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Pulsacao\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 56,\n \"max\": 96,\n \"num_unique_values\": 9,\n \"samples\": [\n 56,\n 64,\n 76\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Sistolica\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 107,\n \"max\": 153,\n \"num_unique_values\": 13,\n \"samples\": [\n 121,\n 119,\n 125\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Diastolica\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 10,\n \"min\": 44,\n \"max\": 87,\n \"num_unique_values\": 17,\n \"samples\": [\n 78,\n 54,\n 83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Colesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 250,\n \"min\": 49,\n \"max\": 972,\n \"num_unique_values\": 20,\n \"samples\": [\n 522,\n 138,\n 972\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"IMC\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3.6657747833036276,\n \"min\": 19.7,\n \"max\": 32.8,\n \"num_unique_values\": 19,\n \"samples\": [\n 23.5,\n 21.3,\n 24.9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 1 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WLnGnhW2G55q", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "be80df5f-6d74-48ef-e56b-cb45dbee5242" + }, + "source": [ + "# Note a coluna Dtype que mostra os tipos de dados (int64 - inteiro; float64 - real)\n", + "pacientes.info()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 21 entries, 0 to 20\n", + "Data columns (total 9 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Ident 21 non-null int64 \n", + " 1 Idade 21 non-null int64 \n", + " 2 Altura 21 non-null float64\n", + " 3 Peso 21 non-null float64\n", + " 4 Pulsacao 21 non-null int64 \n", + " 5 Sistolica 21 non-null int64 \n", + " 6 Diastolica 21 non-null int64 \n", + " 7 Colesterol 21 non-null int64 \n", + " 8 IMC 21 non-null float64\n", + "dtypes: float64(3), int64(6)\n", + "memory usage: 1.6 KB\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qrQ5YhV8gOmX" + }, + "source": [ + "## Medidas de Tendência Central\n", + "A partir dos dados dos pacientes podemos querer saber como vai a saúde deste grupo. Se a lista for muito grande, torna-se complicado analisar um a um. Um primeiro passo é analisar informações que sumarizam (resumem) os dados e então verificar se é possível tirar conclusões. Em geral, podemos ter uma boa ideia disso analisando a tendência central.\n", + "\n", + "Uma **tendência central** (ou, normalmente, uma medida de tendência central) é um valor central ou valor típico para um certo atributo. As medidas de tendência central mais comuns são:\n", + "\n", + "- **Média aritmética** (ou simplesmente, média) - obtida pela soma de todas as medições divididas pelo número de observações no conjunto de dados.\n", + "- **Mediana** - obtida a partir dos dados ordenados, encontrando-se o valor (pertencente ou não à amostra) que a divide ao meio, isto é, 50% dos elementos da amostra são menores ou iguais à mediana e os outros 50% são maiores ou iguais à mediana.\n", + "- **Moda** - valor que aparece com maior frequência no conjunto de dados.\n", + "\n", + "O Pandas tem funções pré-definidas que calculam essas medidas de tendência central:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZlMdvKC-k5cR" + }, + "source": [ + "### Média\n", + "\n", + "Por exemplo, para calcular a média das avaliações, podemos usar a função `mean`, que calcula a média de cada coluna da base de dados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jNoSbTpReeib", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 366 + }, + "outputId": "d0667e99-1b88-4d15-d9d9-0a1f56c317ff" + }, + "source": [ + "## média aritimética\n", + "pacientes.mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ident 11.000000\n", + "Idade 36.476190\n", + "Altura 1.731905\n", + "Peso 77.904762\n", + "Pulsacao 72.761905\n", + "Sistolica 119.952381\n", + "Diastolica 73.428571\n", + "Colesterol 338.285714\n", + "IMC 25.976190\n", + "dtype: float64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
Ident11.000000
Idade36.476190
Altura1.731905
Peso77.904762
Pulsacao72.761905
Sistolica119.952381
Diastolica73.428571
Colesterol338.285714
IMC25.976190
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gx0QuEN6lFaF" + }, + "source": [ + "A instrução acima, aplicada ao `dataframe` mostra a média de todos os atributos. Mas por exemplo, não faz sentido calcular a média do atributo \"Ident\" (código do paciente).\n", + "\n", + "Além disso, normalmente queremos analisar a média de um determinado atributo que queremos investigar.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tnWUGJd5b5NB" + }, + "source": [ + "#### Média especificando o atributo\n", + "\n", + "Para calcular a média de um atributo específico, como a pressão sistólica, devemos especificar o nome do atributo.\n", + "\n", + "Pode-se usar essa notação a seguir, com o atributo entre colchetes: **`[Sistolica]`**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nSBXLZRNb-9l", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "a19b677c-8f2c-47b8-ecee-361598f32360" + }, + "source": [ + "#média da pressão sistólica\n", + "pacientes[\"Sistolica\"].mean()\n" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "np.float64(119.95238095238095)" + ] + }, + "metadata": {}, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aq2mJxvRBcwp" + }, + "source": [ + "Outra notação, um pouco mais simples é usar o nome do atributo direto (sem os colchetes, mas com um ponto extra): **`.Sistolica`**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Geq-izIoTj2Y", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "b0c783aa-c406-495e-9242-bf3fd05dbf37" + }, + "source": [ + "# Outra possibilidade de obtenção da média de um determinado atributo\n", + "pacientes.Sistolica.mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "np.float64(119.95238095238095)" + ] + }, + "metadata": {}, + "execution_count": 4 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hyq03z6TuVrO" + }, + "source": [ + "O Python permite-nos restringir o número de casas decimais, usando o print com formatação. Veja o exemplo a seguir:\n", + "* \"%.2f\" (entre aspas) indica que são 2 casas decimais\n", + "* Note que há % entre o formato (\"%.2f\") e o que será exibido (pacientes.Sistolica.mean()) no respectivo formato." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jG4s2dTruHdz", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "9c270c43-2ded-44f1-d2b3-870191d12f4a" + }, + "source": [ + "# imprimindo com 2 casas decimais: .2\n", + "print(\"%.2f\" % pacientes.Sistolica.mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "119.95\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6HyGJoFv1NY" + }, + "source": [ + "Outra possibilidade é atribuir o resultado para uma variável (por exemplo, media) antes de mostrar na tela (com o `print`)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lZ6SoV6lwCzD", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "9ef98768-d759-426e-be99-5c640bde8418" + }, + "source": [ + "# atribuindo a média para uma variável\n", + "media = pacientes.Sistolica.mean()\n", + "# imprimindo com 2 casas decimais: .2\n", + "print(\"%.2f\" % media)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "119.95\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QZ0fcVkqYiyS" + }, + "source": [ + "Para tentar analisar o significado dessa média, podemos verificar os valores extremos de `Sistolica `(maior e menor)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B0wLr6AZYNuI", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "8e1cdba1-d199-4a4b-9740-1fb0e080f95c" + }, + "source": [ + "# Maior Valor de Pressão Sistólica\n", + "pacientes.Sistolica.max()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "153" + ] + }, + "metadata": {}, + "execution_count": 7 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XUADVExgYvvL", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "9a12d43b-eef8-455c-ef8c-8ee9f45b32de" + }, + "source": [ + "# Menor Valor de Pressão Sistólica\n", + "pacientes.Sistolica.min()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "107" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pYbYdl3qY1P7" + }, + "source": [ + "Nota-se uma variação razoável entre a média `mean()` (119.95) , a maior `max()`pressão sistólica (153) e a menor `min()` pressão sistólica (107). Poderemos entender melhor esta variação, mais adiante, com as medidas de dispersão. Mas antes disso, vamos ver as outras medidas de tendência central." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ouEm0eQ5cCQ-" + }, + "source": [ + "### Mediana\n", + "\n", + "A mediana pode ser calculada usando a função `median`:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_3AW3ND2fnQZ", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 366 + }, + "outputId": "612c5085-2cb4-44ff-878a-e28493ed4b59" + }, + "source": [ + "# mediana\n", + "pacientes.median()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ident 11.00\n", + "Idade 32.00\n", + "Altura 1.73\n", + "Peso 76.10\n", + "Pulsacao 72.00\n", + "Sistolica 119.00\n", + "Diastolica 76.00\n", + "Colesterol 265.00\n", + "IMC 26.00\n", + "dtype: float64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
Ident11.00
Idade32.00
Altura1.73
Peso76.10
Pulsacao72.00
Sistolica119.00
Diastolica76.00
Colesterol265.00
IMC26.00
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hzdo1N7MXzgs" + }, + "source": [ + "#### Mediana especificando o atributo\n", + "Calculando a mediana somente do atributo `Idade`" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7FaXpfkfX5y8", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "24a71938-187a-4143-c9b1-364741bd3750" + }, + "source": [ + "pacientes.Sistolica.median()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "119.0" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UOd2sDYVowjn" + }, + "source": [ + "Nesse caso, os valores da média e da mediana não diferem muito (respectivamente 119.95 e 119), e ambos poderiam ser usados como um representante da tendência central do atributo `Sistólica`. Vale lembrar, que mesmo havendo valores extremos de `Sistolica` (mínimo = 107 e máximo = 153) razoavelmente distantes da Média (119.95), ainda assim a Média e Mediana são valores próximos.\n", + "\n", + "Na presença de valores extremos, a mediana pode ser uma medida **menos sensível** a presença desses valores.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HLS4F9mGcHZ1" + }, + "source": [ + "### Moda\n", + "\n", + "A moda representa o valor que mais ocorre na base de dados. A moda pode **não** existir ou pode não ser única." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DV6ITz4FELAY" + }, + "source": [ + "Uma maneira de obter esse valor seria calcular a frequência de cada valor, agrupando os dados.\n", + "\n", + "Por exemplo, para contar a frequência de cada valor de **pressão Sistólica** poderíamos usar a seguinte instrução:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tDKHaozCE4qq", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 523 + }, + "outputId": "4338b28c-f204-43e5-9f2c-0cc2023811ca" + }, + "source": [ + "# Agrupando por Sistolica e contando pelo atributo Sistolica\n", + "pacientes.groupby(\"Sistolica\")[\"Sistolica\"].count()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Sistolica\n", + "107 2\n", + "109 1\n", + "110 3\n", + "112 2\n", + "113 2\n", + "119 1\n", + "121 2\n", + "125 2\n", + "126 2\n", + "131 1\n", + "132 1\n", + "137 1\n", + "153 1\n", + "Name: Sistolica, dtype: int64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Sistolica
Sistolica
1072
1091
1103
1122
1132
1191
1212
1252
1262
1311
1321
1371
1531
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oLffXN58FFST" + }, + "source": [ + "Note acima que, **Sistolica = 110** é o valor que ocorre mais vezes (3 vezes). Então, neste caso a moda é 110." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xhQPa3zcFeqD" + }, + "source": [ + "Mas o Pandas, possui a função `mode` para calcular a moda:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NfJBPAVojpoB", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 708 + }, + "outputId": "e068f71c-078c-4feb-ed68-557bc45ccb2f" + }, + "source": [ + "# moda\n", + "pacientes.mode()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Ident Idade Altura Peso Pulsacao Sistolica Diastolica Colesterol \\\n", + "0 1 20.0 1.73 60.8 64.0 110.0 81.0 265.0 \n", + "1 2 NaN NaN 61.8 72.0 NaN NaN NaN \n", + "2 3 NaN NaN 62.6 NaN NaN NaN NaN \n", + "3 4 NaN NaN 64.9 NaN NaN NaN NaN \n", + "4 5 NaN NaN 68.1 NaN NaN NaN NaN \n", + "5 6 NaN NaN 68.7 NaN NaN NaN NaN \n", + "6 7 NaN NaN 70.3 NaN NaN NaN NaN \n", + "7 8 NaN NaN 73.9 NaN NaN NaN NaN \n", + "8 9 NaN NaN 74.7 NaN NaN NaN NaN \n", + "9 10 NaN NaN 75.1 NaN NaN NaN NaN \n", + "10 11 NaN NaN 76.1 NaN NaN NaN NaN \n", + "11 12 NaN NaN 78.8 NaN NaN NaN NaN \n", + "12 13 NaN NaN 79.1 NaN NaN NaN NaN \n", + "13 14 NaN NaN 79.5 NaN NaN NaN NaN \n", + "14 15 NaN NaN 80.7 NaN NaN NaN NaN \n", + "15 16 NaN NaN 84.0 NaN NaN NaN NaN \n", + "16 17 NaN NaN 86.0 NaN NaN NaN NaN \n", + "17 18 NaN NaN 90.7 NaN NaN NaN NaN \n", + "18 19 NaN NaN 94.2 NaN NaN NaN NaN \n", + "19 20 NaN NaN 99.3 NaN NaN NaN NaN \n", + "20 21 NaN NaN 106.7 NaN NaN NaN NaN \n", + "\n", + " IMC \n", + "0 24.3 \n", + "1 32.8 \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "5 NaN \n", + "6 NaN \n", + "7 NaN \n", + "8 NaN \n", + "9 NaN \n", + "10 NaN \n", + "11 NaN \n", + "12 NaN \n", + "13 NaN \n", + "14 NaN \n", + "15 NaN \n", + "16 NaN \n", + "17 NaN \n", + "18 NaN \n", + "19 NaN \n", + "20 NaN " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdentIdadeAlturaPesoPulsacaoSistolicaDiastolicaColesterolIMC
0120.01.7360.864.0110.081.0265.024.3
12NaNNaN61.872.0NaNNaNNaN32.8
23NaNNaN62.6NaNNaNNaNNaNNaN
34NaNNaN64.9NaNNaNNaNNaNNaN
45NaNNaN68.1NaNNaNNaNNaNNaN
56NaNNaN68.7NaNNaNNaNNaNNaN
67NaNNaN70.3NaNNaNNaNNaNNaN
78NaNNaN73.9NaNNaNNaNNaNNaN
89NaNNaN74.7NaNNaNNaNNaNNaN
910NaNNaN75.1NaNNaNNaNNaNNaN
1011NaNNaN76.1NaNNaNNaNNaNNaN
1112NaNNaN78.8NaNNaNNaNNaNNaN
1213NaNNaN79.1NaNNaNNaNNaNNaN
1314NaNNaN79.5NaNNaNNaNNaNNaN
1415NaNNaN80.7NaNNaNNaNNaNNaN
1516NaNNaN84.0NaNNaNNaNNaNNaN
1617NaNNaN86.0NaNNaNNaNNaNNaN
1718NaNNaN90.7NaNNaNNaNNaNNaN
1819NaNNaN94.2NaNNaNNaNNaNNaN
1920NaNNaN99.3NaNNaNNaNNaNNaN
2021NaNNaN106.7NaNNaNNaNNaNNaN
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"pacientes\",\n \"rows\": 21,\n \"fields\": [\n {\n \"column\": \"Ident\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 1,\n \"max\": 21,\n \"num_unique_values\": 21,\n \"samples\": [\n 1,\n 18,\n 16\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Idade\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 20.0,\n \"max\": 20.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 20.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Altura\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 1.73,\n \"max\": 1.73,\n \"num_unique_values\": 1,\n \"samples\": [\n 1.73\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Peso\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12.375115199079005,\n \"min\": 60.8,\n \"max\": 106.7,\n \"num_unique_values\": 21,\n \"samples\": [\n 60.8\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Pulsacao\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.656854249492381,\n \"min\": 64.0,\n \"max\": 72.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 72.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Sistolica\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 110.0,\n \"max\": 110.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 110.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Diastolica\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 81.0,\n \"max\": 81.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 81.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Colesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 265.0,\n \"max\": 265.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 265.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"IMC\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.010407640085652,\n \"min\": 24.3,\n \"max\": 32.8,\n \"num_unique_values\": 2,\n \"samples\": [\n 32.8\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "riaPePnaFnnf" + }, + "source": [ + "A instrução acima, mostrou a moda de todos os atributos. A moda de cada atributo, quando é única, aparece na `linha [0]` (linha zero). Note que para alguns atributos, como `Peso` por exemplo, aparece valor de moda em mais de uma linha. Isso ocorre porque todos os valores de Peso são diferentes. **Neste caso dizemos que a moda não existe.**\n", + "\n", + "A instrução a seguir mostra a frequência do atributo `Peso`. Note que a frequência é igual a 1 para todos os pesos.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Abw7wplOHXiK", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 774 + }, + "outputId": "780c9b58-55c1-4848-9f81-458735306f3c" + }, + "source": [ + "# Agrupando por Peso e contando pelo atributo Peso\n", + "pacientes.groupby(\"Peso\")[\"Peso\"].count()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Peso\n", + "60.8 1\n", + "61.8 1\n", + "62.6 1\n", + "64.9 1\n", + "68.1 1\n", + "68.7 1\n", + "70.3 1\n", + "73.9 1\n", + "74.7 1\n", + "75.1 1\n", + "76.1 1\n", + "78.8 1\n", + "79.1 1\n", + "79.5 1\n", + "80.7 1\n", + "84.0 1\n", + "86.0 1\n", + "90.7 1\n", + "94.2 1\n", + "99.3 1\n", + "106.7 1\n", + "Name: Peso, dtype: int64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Peso
Peso
60.81
61.81
62.61
64.91
68.11
68.71
70.31
73.91
74.71
75.11
76.11
78.81
79.11
79.51
80.71
84.01
86.01
90.71
94.21
99.31
106.71
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IHRsSDKSGxj1" + }, + "source": [ + "Uma forma de listar apenas a **\"primeira moda\"** é conforme abaixo, usando a notação `loc[0]`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3432ZFgzcug6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 366 + }, + "outputId": "a0b2e89e-ba3b-4b2b-fdd2-34b7ed52aa07" + }, + "source": [ + "# loc[0] - Exibe a linha 0 da moda\n", + "pacientes.mode().loc[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ident 1.00\n", + "Idade 20.00\n", + "Altura 1.73\n", + "Peso 60.80\n", + "Pulsacao 64.00\n", + "Sistolica 110.00\n", + "Diastolica 81.00\n", + "Colesterol 265.00\n", + "IMC 24.30\n", + "Name: 0, dtype: float64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
Ident1.00
Idade20.00
Altura1.73
Peso60.80
Pulsacao64.00
Sistolica110.00
Diastolica81.00
Colesterol265.00
IMC24.30
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K_-kDnsGceqc" + }, + "source": [ + "#### Moda especificando o atributo\n", + "\n", + "Para listar a moda de um único atributo utiliza-se o nome do atributo, conforme abaixo, seguido de `mode()`. Por exemplo, a moda de `Sistolica`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t8wGf2byclTl", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 115 + }, + "outputId": "d1030edf-69c7-407b-b7cd-23ee7eb53d49" + }, + "source": [ + "pacientes.Sistolica.mode()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 110\n", + "Name: Sistolica, dtype: int64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Sistolica
0110
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yeSvu68wLRiu" + }, + "source": [ + "Caso haja mais de uma moda, todos os valores serão exibidos. Veja o caso de Pulsacao." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ssHXnKA7LXvT", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 147 + }, + "outputId": "d74b27df-f522-4ba8-888c-7d941f5b5877" + }, + "source": [ + "pacientes.Pulsacao.mode()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 64\n", + "1 72\n", + "Name: Pulsacao, dtype: int64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Pulsacao
064
172
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XMlDRgpmLhpH" + }, + "source": [ + "Para conferir, vamos listar a frequência de `Pulsacao`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NXy_oZrdLnZ3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 398 + }, + "outputId": "b367afcd-b63d-4065-9cf6-0fc074b5f986" + }, + "source": [ + "# Agrupando por Pulsacao e contando pelo atributo Pulsacao\n", + "pacientes.groupby(\"Pulsacao\")[\"Pulsacao\"].count()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Pulsacao\n", + "56 1\n", + "60 3\n", + "64 4\n", + "68 1\n", + "72 4\n", + "76 2\n", + "84 2\n", + "88 3\n", + "96 1\n", + "Name: Pulsacao, dtype: int64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Pulsacao
Pulsacao
561
603
644
681
724
762
842
883
961
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HlgdKJUILwn4" + }, + "source": [ + "Note que Pulsacao = 64 e Pulsacao = 72 possuem as maiores frequências. Neste caso, dizemos que a frequência é bimodal (dois valores), ou seja, a moda de Pulsacao é igual a 64 e 72." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cWrHkJw5lW2w" + }, + "source": [ + "## Exibindo Média, Mediana e Moda com print\n", + "\n", + "Veja a seguir que podemos usar o print para exibir uma mensagem com um RÓTULO, indicando o que significa o valor exibido." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vUd9JP5mlq0a", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "b3e6d4dc-2d9f-4a50-8b31-30c160b041fa" + }, + "source": [ + "print(\"Média :\", pacientes.Sistolica.mean())\n", + "print(\"Mediana: \", pacientes.Sistolica.median())\n", + "print(\"Moda: \", pacientes.Sistolica.mode())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Média : 119.95238095238095\n", + "Mediana: 119.0\n", + "Moda: 0 110\n", + "Name: Sistolica, dtype: int64\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ZzytDiyMc8y" + }, + "source": [ + "## Exercícios" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e2KbZ1J4M2x4" + }, + "source": [ + "### Exercícios 01 - Média" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s73_g68JPOq2" + }, + "source": [ + "# Exibir a média do atributo Idade\n", + "# Digite sua resposta aqui" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4XZyXwQ2PGXV" + }, + "source": [ + "### Exercícios 02 - Mediana" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "z0yvhoyCPITw" + }, + "source": [ + "# Exibir a mediana do atributo Idade\n", + "# Digite sua resposta aqui" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8_6h4MHUPI94" + }, + "source": [ + "### Exercícios 03 - Mediana" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "thJ9uZ5HPrVg" + }, + "source": [ + "# Exibir a moda do atributo Idade\n", + "# Digite sua resposta aqui" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XP1dSI0ZmXq7" + }, + "source": [ + "### Exercício 04 - Média, Mediana e Moda" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vn2rpDTBl6q2" + }, + "source": [ + "# Exibir as três medidas de tendência central de Idade (média, mediana e moda) com os rótulos que idenfiquem cada medida.\n", + "# Use um print para cada medida\n", + "# Digite sua resposta aqui" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "29zWYB_0Wyr-" + }, + "source": [ + "## Extra" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dVflM1y8W3b6" + }, + "source": [ + "### Exibindo a maior moda (quando há mais de uma)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "akC3ON1BXqYH" + }, + "source": [ + "Note que existem dois valores de moda para `Pulsacao`" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iW0iwGlLXIKT", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 147 + }, + "outputId": "ea97ec7e-91e0-4d24-a781-f07c3f2f6b8b" + }, + "source": [ + "moda = pacientes.Pulsacao.mode()\n", + "moda" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 64\n", + "1 72\n", + "Name: Pulsacao, dtype: int64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Pulsacao
064
172
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 19 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OetFtMY62jCC", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "135fa84a-c773-46c9-c6f4-b5bef18f6d02" + }, + "source": [ + "pacientes.Pulsacao.mode().max()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "72" + ] + }, + "metadata": {}, + "execution_count": 20 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I-K5NY1OXx2b" + }, + "source": [ + "A função `max()` retorna o maior valor de uma variável, conforme instrução a seguir." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dbpAaAMsXV7h", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "1ad7e5d9-5b88-4ad6-9b97-5ad0bcbb8813" + }, + "source": [ + "moda.max()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "72" + ] + }, + "metadata": {}, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q8Nqe2n-d3hb" + }, + "source": [ + "### Entrada de Dados [1]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gKMIx2lnd9NG" + }, + "source": [ + "Podemos perguntar ao usuário qual o nome da coluna devemos usar para, por exemplo, calcular a média (ou qualquer outro valor).\n", + "\n", + "Isto se chama \"solicitar uma entrada de dados\". Chamamos de **`Entrada de Dados`** pois \"entra\" (via teclado) uma informação na \"memória do computador\". Vamos explorar isso melhor mais adiante." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TCkjpHs9efHi" + }, + "source": [ + "Em python, a função `input()` é usada para a realizar a entrada de dados. Vide exemplo a seguir." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SgDJICZherhL", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "eb74d964-0aa5-410b-d5a2-4b63e4efc5dd" + }, + "source": [ + "coluna = input()" + ], + "execution_count": null, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Pulsacao\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdVAWW4pe-d7" + }, + "source": [ + "Note que a instrução a seguir calcula a média dos valores usando a **\"`coluna`\"** (atributo) informada pelo usuário.\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2hgLvDqZoSxU", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "599deb49-03e0-4096-ec42-8d1945229433" + }, + "source": [ + "media = pacientes[[coluna]].mean()\n", + "print(\"Média da \", coluna, \" = \", media)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Média da Pulsacao = Pulsacao 72.761905\n", + "dtype: float64\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eDCj_MZ102w9" + }, + "source": [ + "[ATENÇÃO]: No \"input\" é possível enviar uma \"mensagem\" para informar o que deve ser digitado." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EhOPO9uY0oaj", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "2a51d065-bc8b-4e0b-db3e-ecbbfc24989f" + }, + "source": [ + "coluna = input(\"Digite o título da coluna para exibir a média:\")\n", + "media = pacientes[[coluna]].mean()\n", + "print(\"Média da \", coluna, \" = \", media)\n" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Digite o título da coluna para exibir a média:Pulsacao\n", + "Média da Pulsacao = Pulsacao 72.761905\n", + "dtype: float64\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zIXX0xtFBJcd" + }, + "source": [ + "Execute novamente a célula acima, digitando outro valor para \"coluna\", e veja o resultado. As instruções funcionam para qualquer valor de \"coluna\" informado (desde que o nome a coluna conste na tabela)." + ] + } + ] +} \ No newline at end of file diff --git "a/RandomForest_(Previs\303\243o_de_Dataset)_Kauan_Santiago.ipynb" "b/RandomForest_(Previs\303\243o_de_Dataset)_Kauan_Santiago.ipynb" new file mode 100644 index 0000000..a710f28 --- /dev/null +++ "b/RandomForest_(Previs\303\243o_de_Dataset)_Kauan_Santiago.ipynb" @@ -0,0 +1,629 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Kauan Santiago\n", + "Engenheiro de Dados" + ], + "metadata": { + "id": "WtWhNJti5A59" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Classificação de Utilidade de Kit Médico — Airline Dataset\n", + "\n", + "## Contexto\n", + "Este notebook tem como objetivo desenvolver e avaliar um modelo de **classificação supervisionada** para prever se um **kit médico adquirido por passageiros durante o check-in** é **útil (0)** ou **não útil (1)**.\n", + "\n", + "O projeto foi desenvolvido por Luan Alysson de Souza, utilizando bibliotecas do ecossistema `scikit-learn` e ferramentas clássicas de análise de dados.\n", + "\n", + "---\n", + "\n", + "## Estrutura do Dataset\n", + "\n", + "O conjunto de dados é composto pelos seguintes arquivos:\n", + "\n", + "- **train.csv** → 6736 linhas × 10 colunas \n", + "- **test.csv** → 2164 linhas × 9 colunas \n", + "\n", + "### Colunas principais:\n", + "| Coluna | Descrição |\n", + "|----------------|------------|\n", + "| `ID` | Identificador único do registro |\n", + "| `Distributor` | Código do distribuidor |\n", + "| `Product` | Código do produto |\n", + "| `Duration` | Tempo até o destino |\n", + "| `Destination` | Código do destino |\n", + "| `Sales` | Valor da venda |\n", + "| `Commission` | Comissão do distribuidor |\n", + "| `Gender` | Gênero do passageiro |\n", + "| `Age` | Idade do passageiro |\n", + "| `Target` | 0: Útil / 1: Não útil |\n", + "\n", + "---\n", + "\n" + ], + "metadata": { + "id": "CnadV1d5DUMJ" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Setup" + ], + "metadata": { + "id": "1bjAi_IkHZ75" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NVMhX55ujDQ-", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "36f0e644-1a80-4d3f-ab92-bc1cba2e1ab9" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Shape do treino: (6736, 10)\n", + "Shape do teste: (2164, 9)\n", + "\n", + "Primeiras linhas do treino:\n", + " ID Distributor Product Duration Destination \\\n", + "0 fffe3800370038003900 7 1 22 122 \n", + "1 fffe34003200370037003500 7 1 26 52 \n", + "2 fffe32003100320030003200 7 10 15 83 \n", + "3 fffe34003400310037003000 8 25 24 55 \n", + "4 fffe32003400390038003000 6 16 12 122 \n", + "\n", + " Sales Commission Gender Age Target \n", + "0 31.0 0.00 NaN 20 0 \n", + "1 22.0 0.00 NaN 36 0 \n", + "2 63.0 0.00 NaN 34 0 \n", + "3 62.0 24.80 0.0 118 0 \n", + "4 19.8 11.88 NaN 26 0 \n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.model_selection import train_test_split, cross_val_score\n", + "from sklearn.preprocessing import LabelEncoder\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.metrics import f1_score, classification_report, confusion_matrix\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "# Carregar os dados\n", + "train_df = pd.read_csv('https://drive.google.com/uc?export=download&id=18qnzpCbngQKduw1pQ0Fki_-3fWdUm1Si')\n", + "test_df = pd.read_csv('https://drive.google.com/uc?export=download&id=1bJf_qceELneHGHlnoENJD1PRqoDImIut')\n", + "\n", + "print(\"Shape do treino:\", train_df.shape)\n", + "print(\"Shape do teste:\", test_df.shape)\n", + "print(\"\\nPrimeiras linhas do treino:\")\n", + "print(train_df.head())" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Descrição" + ], + "metadata": { + "id": "w-cvbmkXHeM5" + } + }, + { + "cell_type": "code", + "source": [ + "# Ver informações dos dados\n", + "print(train_df.info())\n", + "print(\"\\nValores nulos:\")\n", + "print(train_df.isnull().sum())\n", + "\n", + "# Ver distribuição do target\n", + "print(\"\\nDistribuição do Target:\")\n", + "print(train_df['Target'].value_counts())\n", + "print(\"\\nProporção:\")\n", + "print(train_df['Target'].value_counts(normalize=True))\n", + "\n", + "# Ver estatísticas\n", + "print(\"\\nEstatísticas descritivas:\")\n", + "print(train_df.describe())" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9iLq88-6kplI", + "outputId": "64726e78-037a-488a-fcf4-418ebfb76f75" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 6736 entries, 0 to 6735\n", + "Data columns (total 10 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 ID 6736 non-null object \n", + " 1 Distributor 6736 non-null int64 \n", + " 2 Product 6736 non-null int64 \n", + " 3 Duration 6736 non-null int64 \n", + " 4 Destination 6736 non-null int64 \n", + " 5 Sales 6736 non-null float64\n", + " 6 Commission 6736 non-null float64\n", + " 7 Gender 2032 non-null float64\n", + " 8 Age 6736 non-null int64 \n", + " 9 Target 6736 non-null int64 \n", + "dtypes: float64(3), int64(6), object(1)\n", + "memory usage: 526.4+ KB\n", + "None\n", + "\n", + "Valores nulos:\n", + "ID 0\n", + "Distributor 0\n", + "Product 0\n", + "Duration 0\n", + "Destination 0\n", + "Sales 0\n", + "Commission 0\n", + "Gender 4704\n", + "Age 0\n", + "Target 0\n", + "dtype: int64\n", + "\n", + "Distribuição do Target:\n", + "Target\n", + "0 6420\n", + "1 316\n", + "Name: count, dtype: int64\n", + "\n", + "Proporção:\n", + "Target\n", + "0 0.953088\n", + "1 0.046912\n", + "Name: proportion, dtype: float64\n", + "\n", + "Estatísticas descritivas:\n", + " Distributor Product Duration Destination Sales \\\n", + "count 6736.000000 6736.00000 6736.000000 6736.000000 6736.000000 \n", + "mean 6.563539 9.40380 51.588034 81.681413 42.802316 \n", + "std 2.440587 6.62581 79.504738 39.530726 52.408053 \n", + "min 0.000000 0.00000 -1.000000 0.000000 -277.200000 \n", + "25% 6.000000 2.00000 10.000000 55.000000 18.000000 \n", + "50% 7.000000 10.00000 23.000000 86.000000 28.000000 \n", + "75% 7.000000 16.00000 54.000000 112.000000 49.500000 \n", + "max 15.000000 25.00000 444.000000 139.000000 666.000000 \n", + "\n", + " Commission Gender Age Target \n", + "count 6736.000000 2032.000000 6736.000000 6736.000000 \n", + "mean 10.469831 0.512795 39.880344 0.046912 \n", + "std 20.342999 0.499959 13.872811 0.211466 \n", + "min 0.000000 0.000000 1.000000 0.000000 \n", + "25% 0.000000 0.000000 35.000000 0.000000 \n", + "50% 0.000000 1.000000 36.000000 0.000000 \n", + "75% 11.880000 1.000000 44.000000 0.000000 \n", + "max 262.760000 1.000000 118.000000 1.000000 \n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Processamento" + ], + "metadata": { + "id": "iS7OoL3LHiCS" + } + }, + { + "cell_type": "code", + "source": [ + "# Função para processar os dados\n", + "def prepare_data(df, is_train=True):\n", + " df = df.copy()\n", + "\n", + " # Separar ID e Target\n", + " ids = df['ID']\n", + " if is_train:\n", + " target = df['Target']\n", + " df = df.drop(['ID', 'Target'], axis=1)\n", + " else:\n", + " df = df.drop(['ID'], axis=1)\n", + "\n", + " # Identificar colunas categóricas e numéricas\n", + " categorical_cols = df.select_dtypes(include=['object']).columns.tolist()\n", + " numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()\n", + "\n", + " print(f\"Colunas categóricas: {categorical_cols}\")\n", + " print(f\"Colunas numéricas: {numerical_cols}\")\n", + "\n", + " # Preencher valores nulos\n", + " for col in numerical_cols:\n", + " if df[col].isnull().sum() > 0:\n", + " df[col].fillna(df[col].median(), inplace=True)\n", + "\n", + " for col in categorical_cols:\n", + " if df[col].isnull().sum() > 0:\n", + " df[col].fillna(df[col].mode()[0], inplace=True)\n", + "\n", + " # Label Encoding para variáveis categóricas\n", + " label_encoders = {}\n", + " for col in categorical_cols:\n", + " le = LabelEncoder()\n", + " df[col] = le.fit_transform(df[col].astype(str))\n", + " label_encoders[col] = le\n", + "\n", + " if is_train:\n", + " return df, target, ids\n", + " else:\n", + " return df, ids\n", + "\n", + "# Processar dados de treino e teste\n", + "X_train, y_train, train_ids = prepare_data(train_df, is_train=True)\n", + "X_test, test_ids = prepare_data(test_df, is_train=False)\n", + "\n", + "print(\"\\nDados processados!\")\n", + "print(\"Shape X_train:\", X_train.shape)\n", + "print(\"Shape X_test:\", X_test.shape)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "HOCxmQ9ZkypP", + "outputId": "7d0557f8-f8ff-42f4-9503-74ad6efbcc0d" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Colunas categóricas: []\n", + "Colunas numéricas: ['Distributor', 'Product', 'Duration', 'Destination', 'Sales', 'Commission', 'Gender', 'Age']\n", + "Colunas categóricas: []\n", + "Colunas numéricas: ['Distributor', 'Product', 'Duration', 'Destination', 'Sales', 'Commission', 'Gender', 'Age']\n", + "\n", + "Dados processados!\n", + "Shape X_train: (6736, 8)\n", + "Shape X_test: (2164, 8)\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Treinamento" + ], + "metadata": { + "id": "xQ38-HuCHq_J" + } + }, + { + "cell_type": "code", + "source": [ + "# Criar e treinar o modelo Random Forest\n", + "model = RandomForestClassifier(\n", + " n_estimators=200,\n", + " max_depth=15,\n", + " min_samples_split=10,\n", + " min_samples_leaf=5,\n", + " random_state=42,\n", + " n_jobs=-1,\n", + " class_weight='balanced' # Importante para classes desbalanceadas\n", + ")\n", + "\n", + "print(\"Treinando o modelo...\")\n", + "model.fit(X_train, y_train)\n", + "print(\"Modelo treinado!\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bQBgmu4Fk4Bu", + "outputId": "d832c7cb-9a8f-4522-f262-902ce36eefcc" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Treinando o modelo...\n", + "Modelo treinado!\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Previsões/ Treino" + ], + "metadata": { + "id": "XtVRO_mwHvbB" + } + }, + { + "cell_type": "code", + "source": [ + "# Fazer previsões no treino\n", + "train_pred = model.predict(X_train)\n", + "\n", + "# Calcular F1-Score conforme especificado no enunciado\n", + "train_f1 = f1_score(y_train, train_pred, average='weighted')\n", + "train_score = 100 * train_f1 # Multiplicar por 100 conforme enunciado\n", + "\n", + "print(f\"\\nResultados no conjunto de treino:\")\n", + "print(f\"F1-Score (weighted): {train_f1:.4f}\")\n", + "print(f\"Score final (x100): {train_score:.2f}\")\n", + "\n", + "print(\"\\n Matriz de Confusão:\")\n", + "print(confusion_matrix(y_train, train_pred))\n", + "\n", + "print(\"\\nRelatório de Classificação:\")\n", + "print(classification_report(y_train, train_pred))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "M9EzXQqck7dH", + "outputId": "236d9189-199d-4ba6-d031-2dcff7cce1df" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "Resultados no conjunto de treino:\n", + "F1-Score (weighted): 0.9454\n", + "Score final (x100): 94.54\n", + "\n", + " Matriz de Confusão:\n", + "[[6005 415]\n", + " [ 30 286]]\n", + "\n", + "Relatório de Classificação:\n", + " precision recall f1-score support\n", + "\n", + " 0 1.00 0.94 0.96 6420\n", + " 1 0.41 0.91 0.56 316\n", + "\n", + " accuracy 0.93 6736\n", + " macro avg 0.70 0.92 0.76 6736\n", + "weighted avg 0.97 0.93 0.95 6736\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Validação Cruzada" + ], + "metadata": { + "id": "kTdgT6OiH1LO" + } + }, + { + "cell_type": "code", + "source": [ + "# Validação cruzada para verificar a performance\n", + "cv_scores = cross_val_score(\n", + " model, X_train, y_train,\n", + " cv=5,\n", + " scoring='f1_weighted',\n", + " n_jobs=-1\n", + ")\n", + "\n", + "# Multiplicar por 100 para ter o score final\n", + "cv_scores_final = cv_scores * 100\n", + "\n", + "print(f\"\\n Scores da validação cruzada (5-fold):\")\n", + "for i, score in enumerate(cv_scores_final, 1):\n", + " print(f\" Fold {i}: {score:.2f}\")\n", + "\n", + "print(f\"\\nMédia: {cv_scores_final.mean():.2f}\")\n", + "print(f\"Desvio padrão: {cv_scores_final.std():.2f}\")\n", + "print(f\"Intervalo: [{cv_scores_final.mean() - cv_scores_final.std():.2f}, {cv_scores_final.mean() + cv_scores_final.std():.2f}]\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8HojOLyNk-vB", + "outputId": "fa9863a0-7e95-4c1e-ffdb-e9ab17fe8968" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + " Scores da validação cruzada (5-fold):\n", + " Fold 1: 91.67\n", + " Fold 2: 91.24\n", + " Fold 3: 92.27\n", + " Fold 4: 91.40\n", + " Fold 5: 91.43\n", + "\n", + "Média: 91.60\n", + "Desvio padrão: 0.36\n", + "Intervalo: [91.24, 91.96]\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Previsões (Resultado)" + ], + "metadata": { + "id": "uveoaZUlH-K9" + } + }, + { + "cell_type": "code", + "source": [ + "# Fazer previsões\n", + "test_predictions = model.predict(X_test)\n", + "\n", + "print(\"\\nDistribuição das previsões no teste:\")\n", + "unique, counts = np.unique(test_predictions, return_counts=True)\n", + "for val, count in zip(unique, counts):\n", + " print(f\"Classe {val}: {count} ({count/len(test_predictions)*100:.1f}%)\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "L7Dwb3XulG16", + "outputId": "cd330714-196a-473a-e9bd-a359cc6c62c2" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "Distribuição das previsões no teste:\n", + "Classe 0: 1953 (90.2%)\n", + "Classe 1: 211 (9.8%)\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Criar DataFrame para Salvar\n", + "submission = pd.DataFrame({\n", + " 'ID': test_ids,\n", + " 'Target': test_predictions\n", + "})\n", + "\n", + "# Salvar arquivo\n", + "submission.to_csv('resultado_av.csv', index=False)\n", + "\n", + "print(\"\\n Arquivo 'resultado_av.csv' criado com sucesso!\")\n", + "print(f\"Total de previsões: {len(submission)}\")\n", + "\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "OEEgNRJxl4TG", + "outputId": "d3b5a58f-7ce8-4d4c-e153-85b13c82deae" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + " Arquivo 'resultado_av.csv' criado com sucesso!\n", + "Total de previsões: 2164\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "print(\"Primeiras linhas do Resultado\")\n", + "print(submission.head(10))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-5YW7xmoAwOc", + "outputId": "3459c416-9807-40a0-9667-496f0430095c" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Primeiras linhas do Resultado\n", + " ID Target\n", + "0 fffe31003600330038003500 0\n", + "1 fffe33003600300031003400 1\n", + "2 fffe320033003300 1\n", + "3 fffe390039003800 0\n", + "4 fffe3500350031003000 0\n", + "5 fffe31003000300037003300 0\n", + "6 fffe33003300360037003200 0\n", + "7 fffe32003500310030003200 0\n", + "8 fffe32003100320031003600 0\n", + "9 fffe3900380030003200 0\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "O modelo atingiu um F1-Score ponderado de 0.9454, mostrando boa capacidade preditiva geral.\n", + "\n", + "Ainda assim, ajustes voltados para o equilíbrio das classes podem melhorar a performance na detecção de passageiros que consideram o kit “não útil”." + ], + "metadata": { + "id": "IYIvY29HF2AE" + } + } + ] +} \ No newline at end of file