Predição de prisões em ocorrências criminais em Chicago¶

Projeto Final: Big Data e Computação em Nuvem

Por: Ilana Garcia, Izabelle Silva, Julia Navarro, Lívia Bertoni

Contexto do projeto¶

Este projeto trabalha com o conjunto de dados de crimes registrados na cidade de Chicago entre 2001 e 2023, disponibilizado pelo Chicago Police Department e organizado na plataforma Kaggle.
Cada linha representa um incidente criminal reportado, contendo informações essenciais para análise temporal, espacial e contextual:

  • variáveis temporais (data, hora, ano);
  • tipo de crime e descrição detalhada;
  • características do local (residência, comércio, escola, via pública etc.);
  • indicadores importantes como prisão realizada crime doméstico;
  • localização aproximada (quarteirão, distrito policial, área comunitária, latitude e longitude).

Para proteção da privacidade das vítimas, os endereços são aproximados: o nível máximo divulgado é o do quarteirão, e coordenadas são deslocadas.

Ideia central: construir um pipeline completo de Big Data utilizando Spark, desde a ingestão de milhões de registros até a preparação e execução de modelos preditivos, avaliando os fatores associados à realização de prisões em incidentes criminais.

Visão geral do dataset¶

Abaixo, um resumo das principais variáveis que serão utilizadas na análise:

Variável Descrição Tipo Exemplo
ID Identificador único do incidente Inteiro 12345678
Date Data e horário em que o crime foi registrado Data/hora 2019-07-15 23:45:00
Primary Type Categoria principal do crime Categórica THEFT, BATTERY, ROBBERY
Description Descrição mais detalhada do tipo de crime Categórica OVER $500, SIMPLE, ARMED: HANDGUN
Location Description Tipo de local onde ocorreu o crime Categórica STREET, RESIDENCE, SIDEWALK, SCHOOL
Arrest Indica se houve prisão no incidente Binária TRUE / FALSE
Domestic Indica se o crime está relacionado à violência doméstica Binária TRUE / FALSE
Beat / District / Ward / Community Area Códigos administrativos de região policial e área comunitária Inteiro / categórico Beat 111, District 01, Community Area 32
Latitude / Longitude Coordenadas aproximadas do quarteirão do incidente Numérico 41.8781, -87.6298
Year Ano em que o crime foi registrado Inteiro 2008, 2015, 2023
Ao longo da preparação dos dados, novas variáveis derivadas serão criadas (como hora do dia, dia da semana, tipo de período e indicadores agregados por região) para enriquecer a modelagem preditiva.

Objetivo da análise¶

A análise busca responder principalmente a duas questões:

  1. Como os crimes em Chicago se comportam ao longo do tempo e quais padrões podem ser observados?

    • Quais anos apresentam maior volume de ocorrências?
    • Como os incidentes variam por dia da semana e horário?
    • Há categorias de crime particularmente frequentes?
  2. É possível prever a probabilidade de prisão em um incidente a partir de suas características?

    • A partir de atributos como tipo de crime, horário, local e indicadores adicionais, qual a chance de que o registro resulte em prisão?
    • Como lidar com o forte desbalanceamento entre incidentes com e sem prisão?
    • Quais variáveis parecem contribuir mais para o modelo preditivo?

Configuração do Spark¶

Nesta seção configuramos e inicializamos uma sessão Spark — necessária para executar operações distribuídas de leitura, transformação e modelagem com DataFrames.

In [5]:
from pyspark.sql import SparkSession
In [6]:
spark = SparkSession.builder \
    .appName("Crimes Data Analysis") \
    .getOrCreate()
In [7]:
spark
Out[7]:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.5.6
Master
local[*]
AppName
Crimes Data Analysis

Importação do Dataset (KaggleHub)¶

Para iniciar o projeto, precisamos baixar a base de dados Crimes In Chicago (2001 to 2023) diretamente do Kaggle.
Usamos a biblioteca kagglehub, que permite fazer o download automático da versão mais recente do dataset, garantindo praticidade e reprodutibilidade.

O código abaixo:

  • importa o pacote kagglehub;
  • baixa o dataset solicitado;
  • salva o caminho local onde os arquivos foram armazenados;
  • imprime esse caminho para ser usado nas próximas etapas do notebook.

Execute a célula de código logo abaixo:

In [8]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("utkarshx27/crimes-2001-to-present")

print("Path to dataset files:", path)
/opt/jupyterhub-venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Path to dataset files: /tmp/pads/.cache/kagglehub/datasets/utkarshx27/crimes-2001-to-present/versions/1
In [9]:
from pyspark.sql.types import *

schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Case Number", StringType(), True),
    StructField("Date", StringType(), True),
    StructField("Block", StringType(), True),
    StructField("IUCR", StringType(), True),
    StructField("Primary Type", StringType(), True),
    StructField("Description", StringType(), True),
    StructField("Location Description", StringType(), True),
    StructField("Arrest", BooleanType(), True),
    StructField("Domestic", BooleanType(), True),
    StructField("Beat", IntegerType(), True),
    StructField("District", IntegerType(), True),
    StructField("Ward", IntegerType(), True),
    StructField("Community Area", IntegerType(), True),
    StructField("FBI Code", StringType(), True),
    StructField("X Coordinate", FloatType(), True),
    StructField("Y Coordinate", FloatType(), True),
    StructField("Year", IntegerType(), True),
    StructField("Updated On", StringType(), True),
    StructField("Latitude", FloatType(), True),
    StructField("Longitude", FloatType(), True),
    StructField("Location", StringType(), True)
])
In [10]:
crimes_df = spark.read.csv(path, header=True, schema=schema)
In [11]:
crimes_df.show()
                                                                                
+--------+-----------+--------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+
|      ID|Case Number|                Date|               Block|IUCR|      Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On| Latitude| Longitude|            Location|
+--------+-----------+--------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+
|10224738|   HY411648|09/05/2015 01:30:...|     043XX S WOOD ST|0486|           BATTERY|DOMESTIC BATTERY ...|           RESIDENCE| false|    true| 924|       9|  12|            61|     08B|   1165074.0|   1875917.0|2015|02/10/2018 03:50:...|41.815117|    -87.67|(41.815117282, -8...|
|10224739|   HY411615|09/04/2015 11:30:...| 008XX N CENTRAL AVE|0870|             THEFT|      POCKET-PICKING|             CTA BUS| false|   false|1511|      15|  29|            25|      06|   1138875.0|   1904869.0|2015|02/10/2018 03:50:...| 41.89508|  -87.7654|(41.895080471, -8...|
|11646166|   JC213529|09/01/2018 12:01:...|082XX S INGLESIDE...|0810|             THEFT|           OVER $500|           RESIDENCE| false|    true| 631|       6|   8|            44|      06|        NULL|        NULL|2018|04/06/2019 04:04:...|     NULL|      NULL|                NULL|
|10224740|   HY411595|09/05/2015 12:45:...|   035XX W BARRY AVE|2023|         NARCOTICS|POSS: HEROIN(BRN/...|            SIDEWALK|  true|   false|1412|      14|  35|            21|      18|   1152037.0|   1920384.0|2015|02/10/2018 03:50:...|41.937405| -87.71665|(41.937405765, -8...|
|10224741|   HY411610|09/05/2015 01:00:...| 0000X N LARAMIE AVE|0560|           ASSAULT|              SIMPLE|           APARTMENT| false|    true|1522|      15|  28|            25|     08A|   1141706.0|   1900086.0|2015|02/10/2018 03:50:...|41.881905| -87.75512|(41.881903443, -8...|
|10224742|   HY411435|09/05/2015 10:55:...| 082XX S LOOMIS BLVD|0610|          BURGLARY|      FORCIBLE ENTRY|           RESIDENCE| false|   false| 614|       6|  21|            71|      05|   1168430.0|   1850165.0|2015|02/10/2018 03:50:...|41.744377| -87.65843|(41.744378879, -8...|
|10224743|   HY411629|09/04/2015 06:00:...|021XX W CHURCHILL ST|0620|          BURGLARY|      UNLAWFUL ENTRY|    RESIDENCE-GARAGE| false|   false|1434|      14|  32|            24|      05|   1161628.0|   1912157.0|2015|02/10/2018 03:50:...|41.914635| -87.68163|(41.914635603, -8...|
|10224744|   HY411605|09/05/2015 01:00:...|   025XX W CERMAK RD|0860|             THEFT|        RETAIL THEFT|  GROCERY FOOD STORE|  true|   false|1034|      10|  25|            31|      06|   1159734.0|   1889313.0|2015|09/17/2015 11:37:...| 41.85199| -87.68922|(41.851988885, -8...|
|10224745|   HY411654|09/05/2015 11:30:...|031XX W WASHINGTO...|0320|           ROBBERY|STRONGARM - NO WE...|              STREET| false|    true|1222|      12|  27|            27|      03|   1155536.0|   1900515.0|2015|02/10/2018 03:50:...|41.882812| -87.70432|(41.88281374, -87...|
|11645836|   JC212333|05/01/2016 12:25:...| 055XX S ROCKWELL ST|1153|DECEPTIVE PRACTICE|FINANCIAL IDENTIT...|                NULL| false|   false| 824|       8|  15|            63|      11|        NULL|        NULL|2016|04/06/2019 04:04:...|     NULL|      NULL|                NULL|
|10224746|   HY411662|09/05/2015 02:00:...|  071XX S PULASKI RD|0820|             THEFT|      $500 AND UNDER|PARKING LOT/GARAG...| false|   false| 833|       8|  13|            65|      06|   1150938.0|   1857056.0|2015|02/10/2018 03:50:...| 41.76365| -87.72234|(41.763647552, -8...|
|10224749|   HY411626|09/05/2015 11:00:...|052XX N MILWAUKEE...|0460|           BATTERY|              SIMPLE|  SMALL RETAIL STORE| false|   false|1623|      16|  45|            11|     08B|   1137969.0|   1934340.0|2015|02/10/2018 03:50:...|41.975967| -87.76801|(41.975968415, -8...|
|10224750|   HY411632|09/05/2015 03:00:...|    0000X W 103RD ST|2820|     OTHER OFFENSE|    TELEPHONE THREAT|           APARTMENT| false|    true| 512|       5|  34|            49|      26|   1177871.0|   1836676.0|2015|02/10/2018 03:50:...|41.707153|-87.624245|(41.707154919, -8...|
|10224751|   HY411566|09/05/2015 12:50:...|     013XX E 47TH ST|0486|           BATTERY|DOMESTIC BATTERY ...|              STREET| false|    true| 222|       2|   4|            39|     08B|   1185907.0|   1874105.0|2015|02/10/2018 03:50:...|41.809677|-87.593636|(41.809678314, -8...|
|10224752|   HY411601|09/03/2015 01:00:...| 020XX W SCHILLER ST|0810|             THEFT|           OVER $500|              STREET| false|   false|1424|      14|   1|            24|      06|   1162574.0|   1909428.0|2015|02/10/2018 03:50:...|41.907127| -87.67823|(41.907127255, -8...|
|10224753|   HY411489|09/05/2015 11:45:...|  080XX S JUSTINE ST|0497|           BATTERY|AGGRAVATED DOMEST...|           APARTMENT| false|   false| 612|       6|  21|            71|     04B|   1167400.0|   1851512.0|2015|02/10/2018 03:50:...|41.748096| -87.66216|(41.748097343, -8...|
|10224754|   HY411656|09/05/2015 01:30:...|007XX N LEAMINGTO...|1320|   CRIMINAL DAMAGE|          TO VEHICLE|              STREET| false|   false|1531|      15|  28|            25|      14|   1141889.0|   1904448.0|2015|02/10/2018 03:50:...| 41.89387| -87.75434|(41.893869916, -8...|
|10224756|   HY410094|07/08/2015 12:00:...|103XX S TORRENCE AVE|0620|          BURGLARY|      UNLAWFUL ENTRY|               OTHER| false|   false| 434|       4|  10|            51|      05|   1195508.0|   1836950.0|2015|02/10/2018 03:50:...| 41.70749| -87.55965|(41.707490122, -8...|
|10224757|   HY411388|09/05/2015 09:55:...|  088XX S PAULINA ST|0610|          BURGLARY|      FORCIBLE ENTRY|           RESIDENCE|  true|   false|2221|      22|  21|            71|      05|   1166554.0|   1846067.0|2015|02/10/2018 03:50:...|41.733173| -87.66542|(41.733173536, -8...|
|10224758|   HY411568|09/05/2015 12:35:...|    059XX W GRACE ST|0486|           BATTERY|DOMESTIC BATTERY ...|              STREET| false|    true|1633|      16|  38|            15|     08B|   1136014.0|   1924656.0|2015|02/10/2018 03:50:...| 41.94943| -87.77544|(41.949429769, -8...|
+--------+-----------+--------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+
only showing top 20 rows

In [12]:
import pyspark.sql.functions as sf

crimes_df = (
    crimes_df.withColumn('Date', sf.to_timestamp('Date', 'MM/dd/yyyy hh:mm:ss a'))
      .withColumn('Updated On', sf.to_timestamp('Updated On', 'MM/dd/yyyy hh:mm:ss a'))
)
 
In [13]:
crimes_df.printSchema()
root
 |-- ID: integer (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: boolean (nullable = true)
 |-- Domestic: boolean (nullable = true)
 |-- Beat: integer (nullable = true)
 |-- District: integer (nullable = true)
 |-- Ward: integer (nullable = true)
 |-- Community Area: integer (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: float (nullable = true)
 |-- Y Coordinate: float (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Updated On: timestamp (nullable = true)
 |-- Latitude: float (nullable = true)
 |-- Longitude: float (nullable = true)
 |-- Location: string (nullable = true)

In [14]:
crimes_df.show()
+--------+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+----------+--------------------+
|      ID|Case Number|               Date|               Block|IUCR|      Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|         Updated On| Latitude| Longitude|            Location|
+--------+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+----------+--------------------+
|10224738|   HY411648|2015-09-05 13:30:00|     043XX S WOOD ST|0486|           BATTERY|DOMESTIC BATTERY ...|           RESIDENCE| false|    true| 924|       9|  12|            61|     08B|   1165074.0|   1875917.0|2015|2018-02-10 15:50:01|41.815117|    -87.67|(41.815117282, -8...|
|10224739|   HY411615|2015-09-04 11:30:00| 008XX N CENTRAL AVE|0870|             THEFT|      POCKET-PICKING|             CTA BUS| false|   false|1511|      15|  29|            25|      06|   1138875.0|   1904869.0|2015|2018-02-10 15:50:01| 41.89508|  -87.7654|(41.895080471, -8...|
|11646166|   JC213529|2018-09-01 00:01:00|082XX S INGLESIDE...|0810|             THEFT|           OVER $500|           RESIDENCE| false|    true| 631|       6|   8|            44|      06|        NULL|        NULL|2018|2019-04-06 16:04:43|     NULL|      NULL|                NULL|
|10224740|   HY411595|2015-09-05 12:45:00|   035XX W BARRY AVE|2023|         NARCOTICS|POSS: HEROIN(BRN/...|            SIDEWALK|  true|   false|1412|      14|  35|            21|      18|   1152037.0|   1920384.0|2015|2018-02-10 15:50:01|41.937405| -87.71665|(41.937405765, -8...|
|10224741|   HY411610|2015-09-05 13:00:00| 0000X N LARAMIE AVE|0560|           ASSAULT|              SIMPLE|           APARTMENT| false|    true|1522|      15|  28|            25|     08A|   1141706.0|   1900086.0|2015|2018-02-10 15:50:01|41.881905| -87.75512|(41.881903443, -8...|
|10224742|   HY411435|2015-09-05 10:55:00| 082XX S LOOMIS BLVD|0610|          BURGLARY|      FORCIBLE ENTRY|           RESIDENCE| false|   false| 614|       6|  21|            71|      05|   1168430.0|   1850165.0|2015|2018-02-10 15:50:01|41.744377| -87.65843|(41.744378879, -8...|
|10224743|   HY411629|2015-09-04 18:00:00|021XX W CHURCHILL ST|0620|          BURGLARY|      UNLAWFUL ENTRY|    RESIDENCE-GARAGE| false|   false|1434|      14|  32|            24|      05|   1161628.0|   1912157.0|2015|2018-02-10 15:50:01|41.914635| -87.68163|(41.914635603, -8...|
|10224744|   HY411605|2015-09-05 13:00:00|   025XX W CERMAK RD|0860|             THEFT|        RETAIL THEFT|  GROCERY FOOD STORE|  true|   false|1034|      10|  25|            31|      06|   1159734.0|   1889313.0|2015|2015-09-17 11:37:18| 41.85199| -87.68922|(41.851988885, -8...|
|10224745|   HY411654|2015-09-05 11:30:00|031XX W WASHINGTO...|0320|           ROBBERY|STRONGARM - NO WE...|              STREET| false|    true|1222|      12|  27|            27|      03|   1155536.0|   1900515.0|2015|2018-02-10 15:50:01|41.882812| -87.70432|(41.88281374, -87...|
|11645836|   JC212333|2016-05-01 00:25:00| 055XX S ROCKWELL ST|1153|DECEPTIVE PRACTICE|FINANCIAL IDENTIT...|                NULL| false|   false| 824|       8|  15|            63|      11|        NULL|        NULL|2016|2019-04-06 16:04:43|     NULL|      NULL|                NULL|
|10224746|   HY411662|2015-09-05 14:00:00|  071XX S PULASKI RD|0820|             THEFT|      $500 AND UNDER|PARKING LOT/GARAG...| false|   false| 833|       8|  13|            65|      06|   1150938.0|   1857056.0|2015|2018-02-10 15:50:01| 41.76365| -87.72234|(41.763647552, -8...|
|10224749|   HY411626|2015-09-05 11:00:00|052XX N MILWAUKEE...|0460|           BATTERY|              SIMPLE|  SMALL RETAIL STORE| false|   false|1623|      16|  45|            11|     08B|   1137969.0|   1934340.0|2015|2018-02-10 15:50:01|41.975967| -87.76801|(41.975968415, -8...|
|10224750|   HY411632|2015-09-05 03:00:00|    0000X W 103RD ST|2820|     OTHER OFFENSE|    TELEPHONE THREAT|           APARTMENT| false|    true| 512|       5|  34|            49|      26|   1177871.0|   1836676.0|2015|2018-02-10 15:50:01|41.707153|-87.624245|(41.707154919, -8...|
|10224751|   HY411566|2015-09-05 12:50:00|     013XX E 47TH ST|0486|           BATTERY|DOMESTIC BATTERY ...|              STREET| false|    true| 222|       2|   4|            39|     08B|   1185907.0|   1874105.0|2015|2018-02-10 15:50:01|41.809677|-87.593636|(41.809678314, -8...|
|10224752|   HY411601|2015-09-03 13:00:00| 020XX W SCHILLER ST|0810|             THEFT|           OVER $500|              STREET| false|   false|1424|      14|   1|            24|      06|   1162574.0|   1909428.0|2015|2018-02-10 15:50:01|41.907127| -87.67823|(41.907127255, -8...|
|10224753|   HY411489|2015-09-05 11:45:00|  080XX S JUSTINE ST|0497|           BATTERY|AGGRAVATED DOMEST...|           APARTMENT| false|   false| 612|       6|  21|            71|     04B|   1167400.0|   1851512.0|2015|2018-02-10 15:50:01|41.748096| -87.66216|(41.748097343, -8...|
|10224754|   HY411656|2015-09-05 13:30:00|007XX N LEAMINGTO...|1320|   CRIMINAL DAMAGE|          TO VEHICLE|              STREET| false|   false|1531|      15|  28|            25|      14|   1141889.0|   1904448.0|2015|2018-02-10 15:50:01| 41.89387| -87.75434|(41.893869916, -8...|
|10224756|   HY410094|2015-07-08 00:00:00|103XX S TORRENCE AVE|0620|          BURGLARY|      UNLAWFUL ENTRY|               OTHER| false|   false| 434|       4|  10|            51|      05|   1195508.0|   1836950.0|2015|2018-02-10 15:50:01| 41.70749| -87.55965|(41.707490122, -8...|
|10224757|   HY411388|2015-09-05 09:55:00|  088XX S PAULINA ST|0610|          BURGLARY|      FORCIBLE ENTRY|           RESIDENCE|  true|   false|2221|      22|  21|            71|      05|   1166554.0|   1846067.0|2015|2018-02-10 15:50:01|41.733173| -87.66542|(41.733173536, -8...|
|10224758|   HY411568|2015-09-05 12:35:00|    059XX W GRACE ST|0486|           BATTERY|DOMESTIC BATTERY ...|              STREET| false|    true|1633|      16|  38|            15|     08B|   1136014.0|   1924656.0|2015|2018-02-10 15:50:01| 41.94943| -87.77544|(41.949429769, -8...|
+--------+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+----------+--------------------+
only showing top 20 rows

In [15]:
from pyspark.sql import functions as F

crimes_df = crimes_df.select(
    F.col("ID").alias("id"),
    F.col("Case Number").alias("case_number"),
    F.col("Date").alias("date"),
    F.col("Block").alias("block"),
    F.col("IUCR").alias("iucr"),
    F.col("Primary Type").alias("primary_type"),
    F.col("Description").alias("description"),
    F.col("Location Description").alias("location_description"),
    F.col("Arrest").alias("arrest"),
    F.col("Domestic").alias("domestic"),
    F.col("Beat").alias("beat"),
    F.col("District").alias("district"),
    F.col("Ward").alias("ward"),
    F.col("Community Area").alias("community_area"),
    F.col("FBI Code").alias("fbi_code"),
    F.col("X Coordinate").alias("x_coordinate"),
    F.col("Y Coordinate").alias("y_coordinate"),
    F.col("Year").alias("year"),
    F.col("Updated On").alias("updated_on"),
    F.col("Latitude").alias("latitude"),
    F.col("Longitude").alias("longitude"),
    F.col("Location").alias("location"))
In [16]:
crimes_df.limit(10).toPandas()
Out[16]:
id case_number date block iucr primary_type description location_description arrest domestic ... ward community_area fbi_code x_coordinate y_coordinate year updated_on latitude longitude location
0 10224738 HY411648 2015-09-05 13:30:00 043XX S WOOD ST 0486 BATTERY DOMESTIC BATTERY SIMPLE RESIDENCE False True ... 12 61 08B 1165074.0 1875917.0 2015 2018-02-10 15:50:01 41.815117 -87.669998 (41.815117282, -87.669999562)
1 10224739 HY411615 2015-09-04 11:30:00 008XX N CENTRAL AVE 0870 THEFT POCKET-PICKING CTA BUS False False ... 29 25 06 1138875.0 1904869.0 2015 2018-02-10 15:50:01 41.895081 -87.765404 (41.895080471, -87.765400451)
2 11646166 JC213529 2018-09-01 00:01:00 082XX S INGLESIDE AVE 0810 THEFT OVER $500 RESIDENCE False True ... 8 44 06 NaN NaN 2018 2019-04-06 16:04:43 NaN NaN None
3 10224740 HY411595 2015-09-05 12:45:00 035XX W BARRY AVE 2023 NARCOTICS POSS: HEROIN(BRN/TAN) SIDEWALK True False ... 35 21 18 1152037.0 1920384.0 2015 2018-02-10 15:50:01 41.937405 -87.716652 (41.937405765, -87.716649687)
4 10224741 HY411610 2015-09-05 13:00:00 0000X N LARAMIE AVE 0560 ASSAULT SIMPLE APARTMENT False True ... 28 25 08A 1141706.0 1900086.0 2015 2018-02-10 15:50:01 41.881905 -87.755119 (41.881903443, -87.755121152)
5 10224742 HY411435 2015-09-05 10:55:00 082XX S LOOMIS BLVD 0610 BURGLARY FORCIBLE ENTRY RESIDENCE False False ... 21 71 05 1168430.0 1850165.0 2015 2018-02-10 15:50:01 41.744377 -87.658432 (41.744378879, -87.658430635)
6 10224743 HY411629 2015-09-04 18:00:00 021XX W CHURCHILL ST 0620 BURGLARY UNLAWFUL ENTRY RESIDENCE-GARAGE False False ... 32 24 05 1161628.0 1912157.0 2015 2018-02-10 15:50:01 41.914635 -87.681633 (41.914635603, -87.681630909)
7 10224744 HY411605 2015-09-05 13:00:00 025XX W CERMAK RD 0860 THEFT RETAIL THEFT GROCERY FOOD STORE True False ... 25 31 06 1159734.0 1889313.0 2015 2015-09-17 11:37:18 41.851990 -87.689217 (41.851988885, -87.689219118)
8 10224745 HY411654 2015-09-05 11:30:00 031XX W WASHINGTON BLVD 0320 ROBBERY STRONGARM - NO WEAPON STREET False True ... 27 27 03 1155536.0 1900515.0 2015 2018-02-10 15:50:01 41.882812 -87.704323 (41.88281374, -87.704325717)
9 11645836 JC212333 2016-05-01 00:25:00 055XX S ROCKWELL ST 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300 None False False ... 15 63 11 NaN NaN 2016 2019-04-06 16:04:43 NaN NaN None

10 rows × 22 columns

In [152]:
num_rows = crimes_df.count()
num_cols = len(crimes_df.columns)

print(num_rows, num_cols)
[Stage 1010:=>                                                    (1 + 31) / 32]
7784664 22
                                                                                
In [153]:
crimes_df_s = crimes_df.sample(fraction=0.1, seed=42)
In [19]:
crimes_df.describe()
Out[19]:
DataFrame[summary: string, id: string, case_number: string, block: string, iucr: string, primary_type: string, description: string, location_description: string, beat: string, district: string, ward: string, community_area: string, fbi_code: string, x_coordinate: string, y_coordinate: string, year: string, latitude: string, longitude: string, location: string]

Analisando Valores Nulos

In [20]:
print("Linhas:", crimes_df.count())
print("Colunas:", len(crimes_df.columns))
Linhas: 7784664
Colunas: 22
                                                                                
In [21]:
crimes_df.show(5, truncate=False)
+--------+-----------+-------------------+---------------------+----+------------+-----------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+---------+-----------------------------+
|id      |case_number|date               |block                |iucr|primary_type|description            |location_description|arrest|domestic|beat|district|ward|community_area|fbi_code|x_coordinate|y_coordinate|year|updated_on         |latitude |longitude|location                     |
+--------+-----------+-------------------+---------------------+----+------------+-----------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+---------+-----------------------------+
|10224738|HY411648   |2015-09-05 13:30:00|043XX S WOOD ST      |0486|BATTERY     |DOMESTIC BATTERY SIMPLE|RESIDENCE           |false |true    |924 |9       |12  |61            |08B     |1165074.0   |1875917.0   |2015|2018-02-10 15:50:01|41.815117|-87.67   |(41.815117282, -87.669999562)|
|10224739|HY411615   |2015-09-04 11:30:00|008XX N CENTRAL AVE  |0870|THEFT       |POCKET-PICKING         |CTA BUS             |false |false   |1511|15      |29  |25            |06      |1138875.0   |1904869.0   |2015|2018-02-10 15:50:01|41.89508 |-87.7654 |(41.895080471, -87.765400451)|
|11646166|JC213529   |2018-09-01 00:01:00|082XX S INGLESIDE AVE|0810|THEFT       |OVER $500              |RESIDENCE           |false |true    |631 |6       |8   |44            |06      |NULL        |NULL        |2018|2019-04-06 16:04:43|NULL     |NULL     |NULL                         |
|10224740|HY411595   |2015-09-05 12:45:00|035XX W BARRY AVE    |2023|NARCOTICS   |POSS: HEROIN(BRN/TAN)  |SIDEWALK            |true  |false   |1412|14      |35  |21            |18      |1152037.0   |1920384.0   |2015|2018-02-10 15:50:01|41.937405|-87.71665|(41.937405765, -87.716649687)|
|10224741|HY411610   |2015-09-05 13:00:00|0000X N LARAMIE AVE  |0560|ASSAULT     |SIMPLE                 |APARTMENT           |false |true    |1522|15      |28  |25            |08A     |1141706.0   |1900086.0   |2015|2018-02-10 15:50:01|41.881905|-87.75512|(41.881903443, -87.755121152)|
+--------+-----------+-------------------+---------------------+----+------------+-----------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+---------+-----------------------------+
only showing top 5 rows

In [22]:
from pyspark.sql.functions import col, sum

total = crimes_df.count()

nulos = crimes_df.select([
    (sum(col(c).isNull().cast("int")) / total*100).alias(c)
    for c in crimes_df.columns
])

nulos_pd = nulos.toPandas()

nulos_pd
                                                                                
Out[22]:
id case_number date block iucr primary_type description location_description arrest domestic ... ward community_area fbi_code x_coordinate y_coordinate year updated_on latitude longitude location
0 0.0 0.000051 0.0 0.0 0.0 0.0 0.0 0.133352 0.0 0.0 ... 7.898196 7.880571 0.0 1.115629 1.115629 0.0 0.0 1.115629 1.115629 1.115629

1 rows × 22 columns

In [23]:
import matplotlib.pyplot as plt

# Converter Spark -> Pandas
nulos_pd = nulos.toPandas().T.reset_index()
nulos_pd.columns = ["variavel", "percent_nulls"]

plt.figure(figsize=(12, 6))
plt.bar(nulos_pd["variavel"], nulos_pd["percent_nulls"])
plt.xticks(rotation=90)
plt.ylabel("Percentual de valores nulos (%)")
plt.title("Percentual de valores nulos por variável")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [24]:
crimes_df2 = crimes_df.na.drop()
In [25]:
print("Linhas pré tratamento:", crimes_df.count())
print("Linhas pós tratamento de nulos", crimes_df2.count())
Linhas pré tratamento: 7784664
[Stage 21:=============================>                         (17 + 15) / 32]
Linhas pós tratamento de nulos 7084435
                                                                                
In [26]:
from pyspark.sql.functions import approx_count_distinct as acd

card_df2 = crimes_df2.select([
    acd(c).alias(f"{c}_approx_unique")
    for c in crimes_df2.columns
])

card_pd = card_df2.toPandas().T
card_pd.columns = ["approx_unique"]
card_pd
25/12/06 21:20:04 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                
Out[26]:
approx_unique
id_approx_unique 7010933
case_number_approx_unique 6892944
date_approx_unique 2965609
block_approx_unique 35056
iucr_approx_unique 386
primary_type_approx_unique 36
description_approx_unique 516
location_description_approx_unique 226
arrest_approx_unique 2
domestic_approx_unique 2
beat_approx_unique 328
district_approx_unique 25
ward_approx_unique 49
community_area_approx_unique 79
fbi_code_approx_unique 27
x_coordinate_approx_unique 73738
y_coordinate_approx_unique 119819
year_approx_unique 24
updated_on_approx_unique 4664
latitude_approx_unique 84442
longitude_approx_unique 39758
location_approx_unique 677206
In [27]:
crimes_df2.groupBy("primary_type") \
         .count() \
         .orderBy("count", ascending=False) \
         .show(20, truncate=False)
[Stage 27:=============================================>          (26 + 6) / 32]
+--------------------------------+-------+
|primary_type                    |count  |
+--------------------------------+-------+
|THEFT                           |1499197|
|BATTERY                         |1299859|
|CRIMINAL DAMAGE                 |811905 |
|NARCOTICS                       |669097 |
|ASSAULT                         |465810 |
|OTHER OFFENSE                   |440288 |
|BURGLARY                        |390418 |
|MOTOR VEHICLE THEFT             |339630 |
|DECEPTIVE PRACTICE              |302833 |
|ROBBERY                         |267994 |
|CRIMINAL TRESPASS               |195986 |
|WEAPONS VIOLATION               |100385 |
|PROSTITUTION                    |61348  |
|OFFENSE INVOLVING CHILDREN      |49456  |
|PUBLIC PEACE VIOLATION          |48705  |
|SEX OFFENSE                     |26311  |
|CRIM SEXUAL ASSAULT             |24123  |
|INTERFERENCE WITH PUBLIC OFFICER|17821  |
|GAMBLING                        |13405  |
|LIQUOR LAW VIOLATION            |12782  |
+--------------------------------+-------+
only showing top 20 rows

                                                                                
In [28]:
import matplotlib.pyplot as plt

# Agrupamento no Spark
tipos_crimes_df = (
    crimes_df2.groupBy("primary_type")
             .count()
             .orderBy("count", ascending=False)
)

# Coletando para Pandas (seguro porque são poucas categorias)
tipos_crimes_pd = tipos_crimes_df.toPandas()

# Criar gráfico horizontal
plt.figure(figsize=(10, 8))
plt.barh(
    tipos_crimes_pd["primary_type"],
    tipos_crimes_pd["count"]
)
plt.xlabel("Quantidade de crimes")
plt.ylabel("Tipo de crime")
plt.title("Crimes por Tipo")
plt.gca().invert_yaxis()  # maior no topo
plt.tight_layout()
plt.show()
                                                                                
No description has been provided for this image
In [29]:
desc_type_df = crimes_df2.groupBy("primary_type", "description") \
                        .count() \
                        .orderBy("count", ascending=False)
desc_type_pd = desc_type_df.toPandas()
top20 = desc_type_pd.head(20)
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(top20["primary_type"] + " - " + top20["description"], top20["count"])
plt.xlabel("Quantidade de Crimes")
plt.ylabel("Tipo - Descrição")
plt.title("Top 20 Ocorrências por Tipo e Descrição de Crime")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
                                                                                
No description has been provided for this image
In [30]:
crimes_df2.groupBy("community_area") \
         .count() \
         .orderBy("count", ascending=False) \
         .show(15)
[Stage 46:==============================>                        (18 + 14) / 32]
+--------------+------+
|community_area| count|
+--------------+------+
|            25|443654|
|             8|249001|
|            43|234412|
|            23|221397|
|            28|213872|
|            24|207273|
|            29|206533|
|            67|203194|
|            71|201132|
|            49|188672|
|            68|185313|
|            69|176737|
|            32|175126|
|            66|173057|
|            44|156585|
+--------------+------+
only showing top 15 rows

                                                                                
In [31]:
crimes_df2.groupBy("district") \
         .count() \
         .orderBy("count", ascending=False) \
         .show(15)
[Stage 49:>                                                       (0 + 32) / 32]
+--------+------+
|district| count|
+--------+------+
|       8|479088|
|      11|457146|
|       6|418970|
|       7|412732|
|       4|406393|
|      25|402994|
|       3|360823|
|      12|348935|
|       9|345915|
|       2|321434|
|       5|316018|
|      18|316000|
|      19|315365|
|      10|307428|
|      15|304731|
+--------+------+
only showing top 15 rows

                                                                                
In [32]:
crimes_por_ano = crimes_df2.groupBy("year") \
                          .count() \
                          .orderBy("year")

pdf = crimes_por_ano.toPandas()

import seaborn as sns
import matplotlib.pyplot as plt

# Tema elegante
sns.set_theme(style="whitegrid")

# Criar figura
plt.figure(figsize=(14, 6))

# Gráfico
ax = sns.lineplot(
    data=pdf,
    x="year",
    y="count",
    marker="o",
    linewidth=2.2,
    markersize=8
)

# Mostrar todos os anos no eixo X
ax.set_xticks(pdf["year"])

# Rotacionar para não sobrepor
plt.xticks(rotation=45)

# Títulos e estética
plt.title("Crimes por Ano", fontsize=18, weight='bold')
plt.xlabel("Ano", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)

# Ajusta layout pra não cortar nada
plt.tight_layout()

plt.show()
                                                                                
No description has been provided for this image
In [34]:
from pyspark.sql.functions import hour, month, to_timestamp

crimes_df2 = crimes_df2.withColumn(
    "date_ts",
    to_timestamp("date", "MM/dd/yyyy hh:mm:ss a")
)

crimes_df2 = crimes_df2.withColumn("month", month("date_ts"))

crimes_hora_ano = crimes_df2.groupBy("year", "month") \
                           .count() \
                           .orderBy("year", "month")

pdf3 = crimes_hora_ano.toPandas()

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")

plt.figure(figsize=(16, 8))

ax = sns.lineplot(
    data=pdf3,
    x="month",
    y="count",
    hue="year",
    palette="tab20",
    marker="o"
)

# Mostrar todas as horas 0–23
ax.set_xticks(sorted(pdf3["month"].unique()))

plt.xticks(rotation=0)
plt.title("Crimes por Mês do Ano (Separado por Ano)", fontsize=18, weight="bold")
plt.xlabel("Mês", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)

plt.legend(title="Ano", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()

plt.show()
                                                                                
No description has been provided for this image
In [33]:
crimes_tipo_ano = crimes_df2.groupBy("year", "primary_type") \
                           .count() \
                           .orderBy("year", "primary_type")

pdf2 = crimes_tipo_ano.toPandas()

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")

plt.figure(figsize=(16, 8))

ax = sns.lineplot(
    data=pdf2,
    x="year",
    y="count",
    hue="primary_type",      # ➜ uma linha por tipo!
    estimator=None,
    marker="o"
)

# Mostrar todos os anos no eixo X
ax.set_xticks(sorted(pdf2["year"].unique()))

plt.xticks(rotation=45)
plt.title("Crimes por Ano e Tipo de Crime", fontsize=18, weight="bold")
plt.xlabel("Ano", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)

plt.legend(title="Tipo de Crime", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()

plt.show()
                                                                                
No description has been provided for this image
In [35]:
# tirando os dados de 2001,2002 e 2023
crimes_df2 = crimes_df.filter(
    (col("year") != 2001) &
    (col("year") != 2002) &
    (col("year") != 2023)
)
# tiramos porque os dados nulos estavam concentrados em 2001 e 2002. 2023 estava com dados somente ate abril
In [36]:
from pyspark.sql.functions import hour, to_timestamp

crimes_df2 = crimes_df2.withColumn(
    "date_ts",
    to_timestamp("date", "MM/dd/yyyy hh:mm:ss a")
)

crimes_df2 = crimes_df2.withColumn("hour", hour("date_ts"))

crimes_hora_ano = crimes_df2.groupBy("year", "hour") \
                           .count() \
                           .orderBy("year", "hour")

pdf3 = crimes_hora_ano.toPandas()

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")

plt.figure(figsize=(16, 8))

ax = sns.lineplot(
    data=pdf3,
    x="hour",
    y="count",
    hue="year",
    palette="tab20",
    marker="o"
)

# Mostrar todas as horas 0–23
ax.set_xticks(sorted(pdf3["hour"].unique()))

plt.xticks(rotation=0)
plt.title("Crimes por Hora do Dia (Separado por Ano)", fontsize=18, weight="bold")
plt.xlabel("Hora do Dia", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)

plt.legend(title="Ano", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()

plt.show()
                                                                                
No description has been provided for this image
In [37]:
crimes_df2.limit(10).toPandas().T
Out[37]:
0 1 2 3 4 5 6 7 8 9
id 10224738 10224739 11646166 10224740 10224741 10224742 10224743 10224744 10224745 11645836
case_number HY411648 HY411615 JC213529 HY411595 HY411610 HY411435 HY411629 HY411605 HY411654 JC212333
date 2015-09-05 13:30:00 2015-09-04 11:30:00 2018-09-01 00:01:00 2015-09-05 12:45:00 2015-09-05 13:00:00 2015-09-05 10:55:00 2015-09-04 18:00:00 2015-09-05 13:00:00 2015-09-05 11:30:00 2016-05-01 00:25:00
block 043XX S WOOD ST 008XX N CENTRAL AVE 082XX S INGLESIDE AVE 035XX W BARRY AVE 0000X N LARAMIE AVE 082XX S LOOMIS BLVD 021XX W CHURCHILL ST 025XX W CERMAK RD 031XX W WASHINGTON BLVD 055XX S ROCKWELL ST
iucr 0486 0870 0810 2023 0560 0610 0620 0860 0320 1153
primary_type BATTERY THEFT THEFT NARCOTICS ASSAULT BURGLARY BURGLARY THEFT ROBBERY DECEPTIVE PRACTICE
description DOMESTIC BATTERY SIMPLE POCKET-PICKING OVER $500 POSS: HEROIN(BRN/TAN) SIMPLE FORCIBLE ENTRY UNLAWFUL ENTRY RETAIL THEFT STRONGARM - NO WEAPON FINANCIAL IDENTITY THEFT OVER $ 300
location_description RESIDENCE CTA BUS RESIDENCE SIDEWALK APARTMENT RESIDENCE RESIDENCE-GARAGE GROCERY FOOD STORE STREET None
arrest False False False True False False False True False False
domestic True False True False True False False False True False
beat 924 1511 631 1412 1522 614 1434 1034 1222 824
district 9 15 6 14 15 6 14 10 12 8
ward 12 29 8 35 28 21 32 25 27 15
community_area 61 25 44 21 25 71 24 31 27 63
fbi_code 08B 06 06 18 08A 05 05 06 03 11
x_coordinate 1165074.0 1138875.0 NaN 1152037.0 1141706.0 1168430.0 1161628.0 1159734.0 1155536.0 NaN
y_coordinate 1875917.0 1904869.0 NaN 1920384.0 1900086.0 1850165.0 1912157.0 1889313.0 1900515.0 NaN
year 2015 2015 2018 2015 2015 2015 2015 2015 2015 2016
updated_on 2018-02-10 15:50:01 2018-02-10 15:50:01 2019-04-06 16:04:43 2018-02-10 15:50:01 2018-02-10 15:50:01 2018-02-10 15:50:01 2018-02-10 15:50:01 2015-09-17 11:37:18 2018-02-10 15:50:01 2019-04-06 16:04:43
latitude 41.815117 41.895081 NaN 41.937405 41.881905 41.744377 41.914635 41.85199 41.882812 NaN
longitude -87.669998 -87.765404 NaN -87.716652 -87.755119 -87.658432 -87.681633 -87.689217 -87.704323 NaN
location (41.815117282, -87.669999562) (41.895080471, -87.765400451) None (41.937405765, -87.716649687) (41.881903443, -87.755121152) (41.744378879, -87.658430635) (41.914635603, -87.681630909) (41.851988885, -87.689219118) (41.88281374, -87.704325717) None
date_ts 2015-09-05 13:30:00 2015-09-04 11:30:00 2018-09-01 00:01:00 2015-09-05 12:45:00 2015-09-05 13:00:00 2015-09-05 10:55:00 2015-09-04 18:00:00 2015-09-05 13:00:00 2015-09-05 11:30:00 2016-05-01 00:25:00
hour 13 11 0 12 13 10 18 13 11 0
In [38]:
from pyspark.sql.functions import hour

crimes_df2 = crimes_df2.withColumn("hour", hour("date"))
In [39]:
crimes_df2.select("date", "hour").show(20, truncate=False)
+-------------------+----+
|date               |hour|
+-------------------+----+
|2015-09-05 13:30:00|13  |
|2015-09-04 11:30:00|11  |
|2018-09-01 00:01:00|0   |
|2015-09-05 12:45:00|12  |
|2015-09-05 13:00:00|13  |
|2015-09-05 10:55:00|10  |
|2015-09-04 18:00:00|18  |
|2015-09-05 13:00:00|13  |
|2015-09-05 11:30:00|11  |
|2016-05-01 00:25:00|0   |
|2015-09-05 14:00:00|14  |
|2015-09-05 11:00:00|11  |
|2015-09-05 03:00:00|3   |
|2015-09-05 12:50:00|12  |
|2015-09-03 13:00:00|13  |
|2015-09-05 11:45:00|11  |
|2015-09-05 13:30:00|13  |
|2015-07-08 00:00:00|0   |
|2015-09-05 09:55:00|9   |
|2015-09-05 12:35:00|12  |
+-------------------+----+
only showing top 20 rows

In [40]:
from pyspark.sql.functions import dayofweek

crimes_df2 = crimes_df2.withColumn("day_of_week", dayofweek("date_ts"))

crimes_semana_ano = crimes_df2.groupBy("year", "day_of_week") \
                             .count() \
                             .orderBy("year", "day_of_week")

pdf_semana = crimes_semana_ano.toPandas()

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")

plt.figure(figsize=(16, 8))

ax = sns.lineplot(
    data=pdf_semana,
    x="day_of_week",
    y="count",
    hue="year",
    palette="tab20",
    marker="o"
)

# Mostrar dias 1–7
ax.set_xticks(sorted(pdf_semana["day_of_week"].unique()))

plt.xticks(rotation=0)
plt.title("Crimes por Dia da Semana (Separado por Ano)", fontsize=18, weight="bold")
plt.xlabel("Dia da Semana (1=Dom, 7=Sáb)", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)

plt.legend(title="Ano", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()

plt.show()
                                                                                
No description has been provided for this image
In [42]:
from pyspark.sql.functions import col

prisao_abs = crimes_df2.filter(col("arrest") == True) \
                      .groupBy("primary_type") \
                      .count() \
                      .orderBy("count", ascending=False)

from pyspark.sql.functions import col, sum, count

prisao_prop = crimes_df2.groupBy("primary_type") \
    .agg(
        count("*").alias("total_casos"),
        sum(col("arrest").cast("int")).alias("total_prisoes")
    ) \
    .withColumn("proporcao_prisao", col("total_prisoes") / col("total_casos")) \
    .orderBy(col("proporcao_prisao").desc())

pdf_prisao = prisao_prop.toPandas()
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

plt.figure(figsize=(14, 8))

# ordenar por proporção
df15 = pdf_prisao.head(15).sort_values("proporcao_prisao")

sns.barplot(
    data=df15,
    x="proporcao_prisao",
    y="primary_type",
    palette="rocket",
    hue="primary_type",
    dodge=False,
    legend=False
)

plt.title("Proporção de Prisões por Tipo de Crime", fontsize=18, weight="bold")
plt.xlabel("Proporção de Prisões")
plt.ylabel("Tipo de Crime")
plt.tight_layout()
plt.show()
                                                                                
No description has been provided for this image
In [43]:
from pyspark.sql.functions import create_map, lit, col
 
mapping = {
    1: "Rogers Park", 2: "West Ridge", 3: "Uptown", 4: "Lincoln Square",
    5: "North Center", 6: "Lake View", 7: "Lincoln Park", 8: "Near North Side",
    9: "Edison Park", 10: "Norwood Park", 11: "Jefferson Park", 12: "Forest Glen",
    13: "North Park", 14: "Albany Park", 15: "Portage Park", 16: "Irving Park",
    17: "Dunning", 18: "Montclair", 19: "Belmont Cragin", 20: "Hermosa",
    21: "Avondale", 22: "Logan Square", 23: "Humboldt Park", 24: "West Town",
    25: "Austin", 26: "West Garfield Park", 27: "East Garfield Park",
    28: "Near West Side", 29: "North Lawndale", 30: "South Lawndale",
    31: "Lower West Side", 32: "Loop", 33: "Near South Side", 34: "Armour Square",
    35: "Douglas", 36: "Oakland", 37: "Fuller Park", 38: "Grand Boulevard",
    39: "Kenwood", 40: "Washington Park", 41: "Hyde Park", 42: "Woodlawn",
    43: "South Shore", 44: "Chatham", 45: "Avalon Park", 46: "South Chicago",
    47: "Burnside", 48: "Calumet Heights", 49: "Roseland", 50: "Pullman",
    51: "South Deering", 52: "East Side", 53: "West Pullman", 54: "Riverdale",
    55: "Hegewisch", 56: "Garfield Ridge", 57: "Archer Heights",
    58: "Brighton Park", 59: "McKinley Park", 60: "Bridgeport", 61: "New City",
    62: "West Elsdon", 63: "Gage Park", 64: "Clearing", 65: "West Lawn",
    66: "Chicago Lawn", 67: "West Englewood", 68: "Englewood",
    69: "Greater Grand Crossing", 70: "Ashburn", 71: "Auburn Gresham",
    72: "Beverly", 73: "Washington Heights", 74: "Mount Greenwood",
    75: "Morgan Park", 76: "O'Hare", 77: "Edgewater"
}
 
# converter dict → lista alternada para create_map
mapping_expr = create_map([lit(x) for pair in mapping.items() for x in pair])
 
crimes_df2 = crimes_df.withColumn("community_area_name", mapping_expr[col("community_area")])
In [44]:
areas_df = spark.read.csv("Boundaries_-_Community_Areas_20251201.csv", header=True, inferSchema=True)
print("Colunas do CSV:")
print(areas_df.columns)


areas_df.toPandas()
Colunas do CSV:
['the_geom', 'AREA_NUMBE', 'COMMUNITY', 'AREA_NUM_1', 'SHAPE_AREA', 'SHAPE_LEN']
Out[44]:
the_geom AREA_NUMBE COMMUNITY AREA_NUM_1 SHAPE_AREA SHAPE_LEN
0 MULTIPOLYGON (((-87.65455590025104 41.99816614... 1 ROGERS PARK 1 51,259,902.4506 34,052.3975757
1 MULTIPOLYGON (((-87.6846530946559 42.019484772... 2 WEST RIDGE 2 98,429,094.8621 43,020.6894583
2 MULTIPOLYGON (((-87.64102430213292 41.95480280... 3 UPTOWN 3 65,095,642.7289 46,972.7945549
3 MULTIPOLYGON (((-87.6744075678037 41.976103404... 4 LINCOLN SQUARE 4 71,352,328.2399 36,624.6030848
4 MULTIPOLYGON (((-87.67336415409336 41.93234274... 5 NORTH CENTER 5 57,054,167.85 31,391.6697542
... ... ... ... ... ... ...
72 MULTIPOLYGON (((-87.63373383514987 41.72885272... 73 WASHINGTON HEIGHTS 73 79,635,752.8769 42,222.598163
73 MULTIPOLYGON (((-87.69645961375822 41.70714491... 74 MOUNT GREENWOOD 74 75,584,290.0209 48,665.1305392
74 MULTIPOLYGON (((-87.64215204651398 41.68508211... 75 MORGAN PARK 75 91,877,340.6988 46,396.419362
75 MULTIPOLYGON (((-87.83658087874365 41.98639611... 76 OHARE 76 371,835,607.687 173,625.98466
76 MULTIPOLYGON (((-87.65455590025104 41.99816614... 77 EDGEWATER 77 48,449,990.8397 31,004.8309456

77 rows × 6 columns

In [45]:
areas_clean = areas_df.select(
    col("AREA_NUM_1").cast("int").alias("community_area"),
    col("COMMUNITY").alias("community_area_name"),
    col("the_geom"),
    col("SHAPE_AREA").alias("shape_area"),
    col("SHAPE_LEN").alias("shape_len"))
In [46]:
c = crimes_df2.alias("c")
a = areas_clean.alias("a")

crimes_joined = c.join(a, on="community_area", how="left")

taxa_prisao_area = crimes_joined.groupBy(col("c.community_area"), col("a.community_area_name"),col("the_geom"),) \
    .agg(
        count("*").alias("total_casos"),
        sum(col("c.arrest").cast("int")).alias("total_prisoes")
    ) \
    .withColumn("taxa_prisao", col("total_prisoes") / col("total_casos")) \
    .orderBy(col("taxa_prisao").desc())

taxa_prisao_area.toPandas()
                                                                                
Out[46]:
community_area community_area_name the_geom total_casos total_prisoes taxa_prisao
0 26.0 WEST GARFIELD PARK MULTIPOLYGON (((-87.72023936013656 41.86986908... 135522 58957 0.435036
1 27.0 EAST GARFIELD PARK MULTIPOLYGON (((-87.69157000948773 41.88819563... 134276 51471 0.383322
2 25.0 AUSTIN MULTIPOLYGON (((-87.78941511405804 41.91751009... 448276 168183 0.375177
3 23.0 HUMBOLDT PARK MULTIPOLYGON (((-87.69157000948773 41.88819563... 223982 84005 0.375052
4 29.0 NORTH LAWNDALE MULTIPOLYGON (((-87.72023936013656 41.86986908... 209901 77906 0.371156
... ... ... ... ... ... ...
74 7.0 LINCOLN PARK MULTIPOLYGON (((-87.63181810269614 41.93258180... 111392 14793 0.132801
75 72.0 BEVERLY MULTIPOLYGON (((-87.67308255594219 41.73565672... 26039 3349 0.128615
76 9.0 EDISON PARK MULTIPOLYGON (((-87.80675853375328 42.00083736... 7128 806 0.113075
77 0.0 None None 76 8 0.105263
78 12.0 FOREST GLEN MULTIPOLYGON (((-87.76918527760162 42.00488913... 13342 1392 0.104332

79 rows × 6 columns

In [47]:
from shapely import wkt
import geopandas as gpd
import matplotlib.pyplot as plt

taxa_prisao_pd = taxa_prisao_area.toPandas()

taxa_prisao_pd = taxa_prisao_pd.dropna(subset=["the_geom"])

taxa_prisao_pd["geometry"] = taxa_prisao_pd["the_geom"].apply(wkt.loads)

gdf = gpd.GeoDataFrame(
    taxa_prisao_pd,
    geometry="geometry",
    crs="EPSG:4326"   
)

gdf = gdf.to_crs(epsg=3857)

fig, ax = plt.subplots(figsize=(8, 8))

gdf.plot(
    column="taxa_prisao",
    ax=ax,
    edgecolor="black",
    linewidth=0.6,
    legend=True
)

ax.set_title("Taxa de prisão por Community Area em Chicago")
ax.set_axis_off()

plt.savefig("chicago_taxa_prisao_map.png", dpi=150, bbox_inches="tight")

plt.show()
No description has been provided for this image
In [48]:
from pyspark.sql import functions as F

crimes_prisao = crimes_df2.groupBy("arrest") \
    .count() \
    .withColumn("percent", F.col("count") / crimes_df.count() * 100) \
    .withColumn("percent", F.concat(F.format_number("percent", 2), F.lit("%")))

crimes_prisao.show()

# a variável alvo está desbalanceada
[Stage 146:=====>                                                 (3 + 29) / 32]
+------+-------+-------+
|arrest|  count|percent|
+------+-------+-------+
|  true|2034764| 26.14%|
| false|5749900| 73.86%|
+------+-------+-------+

                                                                                
In [50]:
crimes_df2 = crimes_df2.withColumn("hour", hour(col("date")))
In [51]:
from pyspark.sql.functions import col, when
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
In [52]:
df_model = crimes_df2.select(
    col("arrest").cast("int").alias("label"),
    "arrest",
    col("domestic").cast("string").alias("domestic"),
    "primary_type",
    "location_description",
    col("hour").cast("int").alias("hour"),
    col("year").cast("int").alias("year"),
    col("community_area").cast("int").alias("community_area"),
    col("latitude").cast("double").alias("latitude"),
    col("longitude").cast("double").alias("longitude"),
    col("beat").cast("int").alias("beat"),
    col("district").cast("int").alias("district"),
    col("ward").cast("int").alias("ward")
)
In [53]:
# 0) criar coluna 'hour' a partir da coluna de data
from pyspark.sql.functions import hour, col, when

crimes_df2 = crimes_df2.withColumn("hour", hour(col("date")))

# ==========================================
# 1. SELECIONAR COLUNAS E AJUSTAR TIPOS
# ==========================================

df_model = crimes_df2.select(
    # alvo: arrest -> vira label numérico 0/1
    col("arrest").cast("int").alias("label"),
    "arrest",

    # domestic como string para o StringIndexer
    col("domestic").cast("string").alias("domestic"),

    # categóricas
    "primary_type",
    "location_description",

    # numéricas (tipos garantidos)
    col("hour").cast("int").alias("hour"),
    col("year").cast("int").alias("year"),
    col("community_area").cast("int").alias("community_area"),
    col("latitude").cast("double").alias("latitude"),
    col("longitude").cast("double").alias("longitude"),
    col("beat").cast("int").alias("beat"),
    col("district").cast("int").alias("district"),
    col("ward").cast("int").alias("ward")
)

df_model = df_model.dropna(subset=[
    "label", "domestic", "primary_type", "location_description",
    "hour", "year", "community_area", "latitude", "longitude",
    "beat", "district", "ward"
])

print("Linhas no df_model (após dropna):", df_model.count())
df_model.printSchema()
[Stage 155:=====================================================> (31 + 1) / 32]
Linhas no df_model (após dropna): 7084438
root
 |-- label: integer (nullable = true)
 |-- arrest: boolean (nullable = true)
 |-- domestic: string (nullable = true)
 |-- primary_type: string (nullable = true)
 |-- location_description: string (nullable = true)
 |-- hour: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- community_area: integer (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- beat: integer (nullable = true)
 |-- district: integer (nullable = true)
 |-- ward: integer (nullable = true)

                                                                                
In [54]:
from pyspark.sql.functions import col

# ==========================================
# 2. CRIAR AMOSTRA MENOR (df_small)
# ==========================================

# pega ~5% da base para não estourar memória na modelagem
df_small = df_model.sample(withReplacement=False, fraction=0.05, seed=42)

print("Linhas no df_model completo :", df_model.count())
print("Linhas na amostra (df_small):", df_small.count())

# ver o desbalanceamento da variável alvo na amostra
df_small.groupBy("label").count().show()
                                                                                
Linhas no df_model completo : 7084438
                                                                                
Linhas na amostra (df_small): 353791
[Stage 164:============================================>          (26 + 6) / 32]
+-----+------+
|label| count|
+-----+------+
|    1| 91896|
|    0|261895|
+-----+------+

                                                                                
In [55]:
from pyspark.sql.functions import when

# ==========================================
# 3. BALANCEAR A CLASSE COM WEIGHT + SPLIT
# ==========================================

# contar classe majoritária e minoritária na amostra
major = df_small.filter(col("label") == 0).count()
minor = df_small.filter(col("label") == 1).count()

print("Classe 0 (sem prisão):", major)
print("Classe 1 (com prisão):", minor)

ratio = major / minor
print("Peso aplicado à classe 1 (prisão):", ratio)

# criar coluna de peso
df_small = df_small.withColumn(
    "weight",
    when(col("label") == 1, ratio).otherwise(1.0)
)

# split treino / teste
train, test = df_small.randomSplit([0.8, 0.2], seed=42)

print("Linhas treino:", train.count())
print("Linhas teste :", test.count())
                                                                                
Classe 0 (sem prisão): 261895
Classe 1 (com prisão): 91896
Peso aplicado à classe 1 (prisão): 2.8499064159484635
                                                                                
Linhas treino: 282923
[Stage 176:=====================================================> (31 + 1) / 32]
Linhas teste : 70868
                                                                                
In [53]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# ==========================================
# 4. PRÉ-PROCESSAMENTO + MODELO (PIPELINE)
# ==========================================

# colunas categóricas que vamos codificar
categorical_cols = [
    "primary_type",
    "location_description",
    "domestic",
    "community_area"
]

# StringIndexer: texto -> índice numérico
indexers = [
    StringIndexer(
        inputCol=c,
        outputCol=f"{c}_idx",
        handleInvalid="keep"
    )
    for c in categorical_cols
]

# OneHotEncoder: índice -> vetor binário
encoders = [
    OneHotEncoder(
        inputCols=[f"{c}_idx"],
        outputCols=[f"{c}_ohe"]
    )
    for c in categorical_cols
]

# colunas numéricas + OHE que vão virar o vetor de features
feature_cols = [
    "primary_type_ohe",
    "location_description_ohe",
    "domestic_ohe",
    "community_area_ohe",
    "hour",
    "year",
    "latitude",
    "longitude",
    "beat",
    "district",
    "ward"
]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)
25/12/06 19:57:08 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
[Stage 237:>                                                        (0 + 1) / 1]
+-----+----------------------------------------+----------+
|label|probability                             |prediction|
+-----+----------------------------------------+----------+
|0    |[0.7306128724815133,0.2693871275184867] |0.0       |
|0    |[0.8482676816866546,0.15173231831334544]|0.0       |
|0    |[0.854632023435212,0.14536797656478795] |0.0       |
|0    |[0.7668307622814486,0.23316923771855136]|0.0       |
|0    |[0.7543679356585427,0.24563206434145735]|0.0       |
|0    |[0.4845632496315539,0.515436750368446]  |1.0       |
|0    |[0.5956006353741462,0.4043993646258538] |0.0       |
|0    |[0.5459528236746588,0.4540471763253412] |0.0       |
|0    |[0.6910459715206198,0.30895402847938025]|0.0       |
|0    |[0.683761820797112,0.316238179202888]   |0.0       |
+-----+----------------------------------------+----------+
only showing top 10 rows

                                                                                
In [ ]:
# modelo de regressão logística usando o peso
lr = LogisticRegression(
    labelCol="label",
    featuresCol="features",
    weightCol="weight",
    maxIter=20
)

# pipeline completa: indexers -> encoders -> assembler -> modelo
pipeline_lr = Pipeline(stages=indexers + encoders + [assembler, lr])

# treinar o modelo
lr_model = pipeline_lr.fit(train)

# gerar previsões no conjunto de teste
pred_lr = lr_model.transform(test)

pred_lr.select("label", "probability", "prediction").show(10, truncate=False)
In [53]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.sql.functions import countDistinct

# ==========================================
# 5. AVALIAÇÃO DO MODELO (AUC, F1, ACCURÁCIA)
# ==========================================

# ver se temos as duas classes no teste
pred_lr.groupBy("label").count().show()

num_classes_test = pred_lr.select(countDistinct("label")).collect()[0][0]
print("Número de classes no conjunto de teste:", num_classes_test)

# --- AUC (curva ROC) ---
# usando a coluna 'probability'
evaluator_auc = BinaryClassificationEvaluator(
    labelCol="label",
    rawPredictionCol="probability",   # usa a probabilidade da classe 1
    metricName="areaUnderROC"
)

if num_classes_test > 1:
    auc_lr = evaluator_auc.evaluate(pred_lr)
    print("AUC (Logistic Regression):", auc_lr)
else:
    print("AUC não pôde ser calculado: só há uma classe no conjunto de teste.")

# --- F1-score ---
evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)
f1_lr = evaluator_f1.evaluate(pred_lr)
print("F1 (Logistic Regression):", f1_lr)

# --- Acurácia ---
evaluator_acc = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)
acc_lr = evaluator_acc.evaluate(pred_lr)
print("Accuracy (Logistic Regression):", acc_lr)
                                                                                
+-----+-----+
|label|count|
+-----+-----+
|    1|18430|
|    0|52438|
+-----+-----+

                                                                                
Número de classes no conjunto de teste: 2
                                                                                
AUC (Logistic Regression): 0.8762476988301116
                                                                                
F1 (Logistic Regression): 0.8358207838802699
[Stage 260:================================================>      (28 + 4) / 32]
Accuracy (Logistic Regression): 0.835906191793193
                                                                                
In [54]:
from pyspark.sql.functions import col

# MATRIZ DE CONFUSÃO (tabela simples)
confusion = (
    pred_lr
      .groupBy("label", "prediction")
      .count()
      .orderBy("label", "prediction")
)

confusion.show()
[Stage 262:=========================================>             (24 + 8) / 32]
+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    0|       0.0|46653|
|    0|       1.0| 5785|
|    1|       0.0| 5844|
|    1|       1.0|12586|
+-----+----------+-----+

                                                                                
In [55]:
from pyspark.sql.functions import sum as Fsum, when

# Criar colunas booleanas para cada combinação
metrics_df = pred_lr.select(
    when( (col("label") == 1) & (col("prediction") == 1), 1).otherwise(0).alias("TP"),
    when( (col("label") == 0) & (col("prediction") == 1), 1).otherwise(0).alias("FP"),
    when( (col("label") == 1) & (col("prediction") == 0), 1).otherwise(0).alias("FN"),
    when( (col("label") == 0) & (col("prediction") == 0), 1).otherwise(0).alias("TN")
)

agg = metrics_df.agg(
    Fsum("TP").alias("TP"),
    Fsum("FP").alias("FP"),
    Fsum("FN").alias("FN"),
    Fsum("TN").alias("TN")
).collect()[0]

TP = agg["TP"]
FP = agg["FP"]
FN = agg["FN"]
TN = agg["TN"]

print("TP:", TP, "FP:", FP, "FN:", FN, "TN:", TN)

# precisão, recall, F1 para classe 1 (prisão)
precision_1 = TP / (TP + FP) if (TP + FP) > 0 else 0
recall_1    = TP / (TP + FN) if (TP + FN) > 0 else 0
f1_1        = 2 * precision_1 * recall_1 / (precision_1 + recall_1) if (precision_1 + recall_1) > 0 else 0

print(f"Classe 1 (prisão) - Precision: {precision_1:.3f}, Recall: {recall_1:.3f}, F1: {f1_1:.3f}")
[Stage 265:=========================>                            (15 + 17) / 32]
TP: 12586 FP: 5785 FN: 5844 TN: 46653
Classe 1 (prisão) - Precision: 0.685, Recall: 0.683, F1: 0.684
                                                                                
In [56]:
# Classe 0
precision_0 = TN / (TN + FN) if (TN + FN) > 0 else 0
recall_0    = TN / (TN + FP) if (TN + FP) > 0 else 0
f1_0        = 2 * precision_0 * recall_0 / (precision_0 + recall_0) if (precision_0 + recall_0) > 0 else 0

print(f"Classe 0 (sem prisão) - Precision: {precision_0:.3f}, Recall: {recall_0:.3f}, F1: {f1_0:.3f}")
Classe 0 (sem prisão) - Precision: 0.889, Recall: 0.890, F1: 0.889
In [57]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)

macro_f1 = evaluator_f1.evaluate(pred_lr)
print("Macro F1 (média entre classes):", macro_f1)
[Stage 268:>                                                      (0 + 32) / 32]
Macro F1 (média entre classes): 0.8358207838802699
                                                                                
In [58]:
from pyspark.ml.classification import LogisticRegressionModel

# pegar o último estágio da pipeline (deve ser o modelo treinado)
last_stage = lr_model.stages[-1]
print("Tipo do último estágio:", type(last_stage))

# garantir que é um LogisticRegressionModel
if isinstance(last_stage, LogisticRegressionModel):
    lr_stage = last_stage
    training_summary = lr_stage.summary

    # DataFrame com pontos da curva ROC
    roc_df = training_summary.roc
    roc_df.show(5)

    print("AUC (via summary):", training_summary.areaUnderROC)
else:
    print("Último estágio não é LogisticRegressionModel, é:", type(last_stage))
Tipo do último estágio: <class 'pyspark.ml.classification.LogisticRegressionModel'>
                                                                                
+--------------------+--------------------+
|                 FPR|                 TPR|
+--------------------+--------------------+
|                 0.0|                 0.0|
|9.548499214635939E-6|0.003865733808836...|
|2.864549764390782E-5|0.008139819780578763|
|2.864549764390782E-5|0.012060000544469537|
|3.341974725122579E-5|0.016238804344866995|
+--------------------+--------------------+
only showing top 5 rows

AUC (via summary): 0.8770441487678992
In [59]:
roc_pd = roc_df.toPandas()

import matplotlib.pyplot as plt

plt.figure(figsize=(6, 6))
plt.plot(roc_pd["FPR"], roc_pd["TPR"])
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Curva ROC - Regressão Logística")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [54]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

# Modelo de Floresta Aleatória
rf = RandomForestClassifier(
    labelCol="label",
    featuresCol="features",
    weightCol="weight",   # usa os mesmos pesos da logística
    numTrees=50,         # pode reduzir p/ 50 se der memória
    maxDepth=10,          # controla complexidade
    featureSubsetStrategy="auto",
    seed=42
)

# Pipeline: indexers -> encoders -> assembler -> Random Forest
pipeline_rf = Pipeline(stages=indexers + encoders + [assembler, rf])
In [55]:
# Treinar a Random Forest
rf_model = pipeline_rf.fit(train)

# Predições no conjunto de teste
pred_rf = rf_model.transform(test)
pred_rf.select("label", "probability", "prediction").show(10, truncate=False)
25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_27 in memory! (computed 8.0 MiB so far)
25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_2 in memory! (computed 12.4 MiB so far)
25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_1 in memory! (computed 12.4 MiB so far)
25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_28 in memory! (computed 12.4 MiB so far)
25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_27 to disk instead.
25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_1 to disk instead.
25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_2 to disk instead.
25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_28 to disk instead.
25/12/06 19:57:54 WARN DAGScheduler: Broadcasting large task binary with size 1269.3 KiB
25/12/06 19:57:55 WARN DAGScheduler: Broadcasting large task binary with size 1791.6 KiB
25/12/06 19:57:56 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
25/12/06 19:57:58 WARN DAGScheduler: Broadcasting large task binary with size 1331.9 KiB
[Stage 277:>                                                        (0 + 1) / 1]
+-----+----------------------------------------+----------+
|label|probability                             |prediction|
+-----+----------------------------------------+----------+
|0    |[0.5082534895697707,0.4917465104302293] |0.0       |
|0    |[0.6237785832628645,0.37622141673713544]|0.0       |
|0    |[0.6170512561154904,0.38294874388450956]|0.0       |
|0    |[0.4386105172615851,0.5613894827384149] |1.0       |
|0    |[0.5267942750429343,0.4732057249570656] |0.0       |
|0    |[0.5039587905528343,0.4960412094471657] |0.0       |
|0    |[0.5642040166295567,0.4357959833704433] |0.0       |
|0    |[0.5539469405294192,0.44605305947058077]|0.0       |
|0    |[0.563183374732184,0.43681662526781595] |0.0       |
|0    |[0.5579666344659647,0.4420333655340352] |0.0       |
+-----+----------------------------------------+----------+
only showing top 10 rows

                                                                                
In [56]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

evaluator_auc = BinaryClassificationEvaluator(
    labelCol="label",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)

evaluator_acc = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

auc_rf = evaluator_auc.evaluate(pred_rf)
f1_rf = evaluator_f1.evaluate(pred_rf)
acc_rf = evaluator_acc.evaluate(pred_rf)

print("=== RANDOM FOREST ===")
print(f"AUC:      {auc_rf:.4f}")
print(f"F1-score: {f1_rf:.4f}")
print(f"Accuracy: {acc_rf:.4f}")
25/12/06 19:58:32 WARN DAGScheduler: Broadcasting large task binary with size 1328.9 KiB
25/12/06 19:58:36 WARN DAGScheduler: Broadcasting large task binary with size 1340.6 KiB
25/12/06 19:58:40 WARN DAGScheduler: Broadcasting large task binary with size 1340.6 KiB
[Stage 291:===================================================>   (30 + 2) / 32]
=== RANDOM FOREST ===
AUC:      0.8657
F1-score: 0.8389
Accuracy: 0.8425
                                                                                
In [56]:
## vai dar erro, ver de tirar 

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.sql.functions import col

# ==========================================
# 1. Definir o modelo base de Random Forest
#    (reaproveitando indexers, encoders, assembler)
# ==========================================

rf = RandomForestClassifier(
    labelCol="label",
    featuresCol="features",
    weightCol="weight",   # se der muito pesado, você pode remover este parâmetro
    seed=42
)

pipeline_rf = Pipeline(stages=indexers + encoders + [assembler, rf])

# ==========================================
# 2. Definir grade de hiperparâmetros (pequena!)
# ==========================================

paramGrid_rf = (
    ParamGridBuilder()
    .addGrid(rf.numTrees, [50, 150])      # número de árvores
    .addGrid(rf.maxDepth, [8, 12])        # profundidade máxima
    .addGrid(rf.featureSubsetStrategy, ["sqrt", "log2"])  # nº de features por split
    .build()
)

# ==========================================
# 3. Definir avaliador (otimizando AUC)
# ==========================================

evaluator_auc = BinaryClassificationEvaluator(
    labelCol="label",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

# ==========================================
# 4. TrainValidationSplit (mais leve que cross-validation)
# ==========================================

tvs_rf = TrainValidationSplit(
    estimator=pipeline_rf,
    estimatorParamMaps=paramGrid_rf,
    evaluator=evaluator_auc,
    trainRatio=0.8,      # 80% treino / 20% validação dentro do conjunto de treino
    parallelism=2        # se o ambiente suportar
)

# ==========================================
# 5. Treinar com tuning na amostra (df_small)
#    (garantir que train/test vieram de df_small)
# ==========================================

# Se ainda não tiver feito o split usando df_small:
# train, test = df_small.randomSplit([0.8, 0.2], seed=42)

print("Treinando Random Forest com tuning...")
tvs_model_rf = tvs_rf.fit(train)

# Melhor modelo dentro do grid
best_model_rf = tvs_model_rf.bestModel
best_rf_stage = best_model_rf.stages[-1]   # último estágio é o RandomForestModel

print("Melhores hiperparâmetros encontrados:")
print("  numTrees:", best_rf_stage.getNumTrees)
print("  maxDepth:", best_rf_stage.getOrDefault("maxDepth"))
print("  featureSubsetStrategy:", best_rf_stage.getOrDefault("featureSubsetStrategy"))

# ==========================================
# 6. Avaliar o melhor modelo no conjunto de teste
# ==========================================

pred_rf_best = best_model_rf.transform(test)

# AUC
auc_rf = evaluator_auc.evaluate(pred_rf_best)
print("\n=== RANDOM FOREST (TUNADO) ===")
print(f"AUC:      {auc_rf:.4f}")

# F1 e Accuracy
evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)
evaluator_acc = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

f1_rf = evaluator_f1.evaluate(pred_rf_best)
acc_rf = evaluator_acc.evaluate(pred_rf_best)

print(f"F1-score: {f1_rf:.4f}")
print(f"Accuracy: {acc_rf:.4f}")

# (opcional) ver distribuição de acertos/erros
pred_rf_best.groupBy("label", "prediction").count().show()
Treinando Random Forest com tuning...
25/12/06 20:14:11 WARN DAGScheduler: Broadcasting large task binary with size 1206.8 KiB
25/12/06 20:14:50 WARN DAGScheduler: Broadcasting large task binary with size 1206.8 KiB
25/12/06 20:14:52 WARN DAGScheduler: Broadcasting large task binary with size 1085.7 KiB
25/12/06 20:14:53 WARN DAGScheduler: Broadcasting large task binary with size 1670.3 KiB
25/12/06 20:14:53 WARN DAGScheduler: Broadcasting large task binary with size 1375.6 KiB
25/12/06 20:14:56 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
25/12/06 20:14:56 WARN DAGScheduler: Broadcasting large task binary with size 1717.8 KiB
25/12/06 20:14:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/12/06 20:14:59 WARN DAGScheduler: Broadcasting large task binary with size 3.0 MiB
25/12/06 20:15:03 WARN DAGScheduler: Broadcasting large task binary with size 3.9 MiB
25/12/06 20:15:03 WARN DAGScheduler: Broadcasting large task binary with size 1143.3 KiB
25/12/06 20:15:06 WARN DAGScheduler: Broadcasting large task binary with size 2042.8 KiB
25/12/06 20:15:09 WARN MemoryStore: Not enough space to cache rdd_1305_3 in memory! (computed 12.0 MiB so far)
25/12/06 20:15:09 WARN MemoryStore: Not enough space to cache rdd_1305_26 in memory! (computed 12.0 MiB so far)
25/12/06 20:15:09 WARN BlockManager: Persisting block rdd_1305_26 to disk instead.
25/12/06 20:15:09 WARN BlockManager: Persisting block rdd_1305_3 to disk instead.
25/12/06 20:15:50 WARN DAGScheduler: Broadcasting large task binary with size 1430.3 KiB
25/12/06 20:15:50 WARN DAGScheduler: Broadcasting large task binary with size 1144.7 KiB
25/12/06 20:15:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/12/06 20:15:57 WARN DAGScheduler: Broadcasting large task binary with size 1626.0 KiB
25/12/06 20:15:59 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
25/12/06 20:16:00 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
25/12/06 20:16:06 WARN DAGScheduler: Broadcasting large task binary with size 1954.0 KiB
25/12/06 20:16:07 WARN DAGScheduler: Broadcasting large task binary with size 1457.9 KiB
25/12/06 20:16:11 WARN MemoryStore: Not enough space to cache rdd_1589_31 in memory! (computed 12.0 MiB so far)
25/12/06 20:16:11 WARN BlockManager: Persisting block rdd_1589_31 to disk instead.
25/12/06 20:16:53 WARN DAGScheduler: Broadcasting large task binary with size 1430.3 KiB
25/12/06 20:16:53 WARN DAGScheduler: Broadcasting large task binary with size 1144.7 KiB
25/12/06 20:16:54 WARN MemoryStore: Not enough space to cache rdd_1594_1 in memory! (computed 12.0 MiB so far)
25/12/06 20:16:54 WARN MemoryStore: Not enough space to cache rdd_1594_2 in memory! (computed 8.0 MiB so far)
25/12/06 20:16:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/12/06 20:16:59 WARN DAGScheduler: Broadcasting large task binary with size 1626.0 KiB
25/12/06 20:17:01 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
25/12/06 20:17:01 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
25/12/06 20:17:04 WARN MemoryStore: Not enough space to cache rdd_1594_1 in memory! (computed 12.0 MiB so far)
25/12/06 20:17:08 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB
25/12/06 20:17:17 ERROR Executor: Exception in task 1.0 in stage 659.0 (TID 11116)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 12.0 in stage 659.0 (TID 11127)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 27.0 in stage 659.0 (TID 11142)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 7.0 in stage 659.0 (TID 11122)
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.lang.Integer.valueOf(Integer.java:1059)
	at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
	at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:246)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:246)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:246)
	at org.apache.spark.ml.tree.impl.DTStatsAggregator.<init>(DTStatsAggregator.scala:54)
	at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22(RandomForest.scala:651)
	at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22$adapted(RandomForest.scala:647)
	at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6148/0x0000000841998040.apply(Unknown Source)
	at scala.Array$.tabulate(Array.scala:418)
	at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$21(RandomForest.scala:647)
	at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6138/0x000000084195e840.apply(Unknown Source)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:858)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:858)
	at org.apache.spark.rdd.RDD$$Lambda$4717/0x0000000841702c40.apply(Unknown Source)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
	at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
25/12/06 20:17:17 ERROR Executor: Exception in task 26.0 in stage 659.0 (TID 11141)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 31.0 in stage 659.0 (TID 11146)
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.lang.reflect.Array.newInstance(Array.java:78)
	at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2098)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1675)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
	at org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1735)
	at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:961)
	at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1276)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1377)
	at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
	at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 27.0 in stage 659.0 (TID 11142),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1.0 in stage 659.0 (TID 11116),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 7.0 in stage 659.0 (TID 11122),5,main]
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.lang.Integer.valueOf(Integer.java:1059)
	at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
	at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:246)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:246)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:246)
	at org.apache.spark.ml.tree.impl.DTStatsAggregator.<init>(DTStatsAggregator.scala:54)
	at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22(RandomForest.scala:651)
	at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22$adapted(RandomForest.scala:647)
	at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6148/0x0000000841998040.apply(Unknown Source)
	at scala.Array$.tabulate(Array.scala:418)
	at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$21(RandomForest.scala:647)
	at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6138/0x000000084195e840.apply(Unknown Source)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:858)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:858)
	at org.apache.spark.rdd.RDD$$Lambda$4717/0x0000000841702c40.apply(Unknown Source)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
	at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 31.0 in stage 659.0 (TID 11146),5,main]
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.lang.reflect.Array.newInstance(Array.java:78)
	at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2098)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1675)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
	at org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1735)
	at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:961)
	at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1276)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1377)
	at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
	at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 26.0 in stage 659.0 (TID 11141),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 12.0 in stage 659.0 (TID 11127),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 WARN TaskSetManager: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space

25/12/06 20:17:17 ERROR TaskSetManager: Task 26 in stage 659.0 failed 1 times; aborting job
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5c068313 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@13d0dc8a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@9dc0ec5 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@731ff9b3 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@69889f22 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@515d9541 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Instrumentation: org.apache.spark.SparkException: Job 367 cancelled because SparkContext was shut down
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1259)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1257)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1257)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:3129)
	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$stop$3(DAGScheduler.scala:3015)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1375)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:3015)
	at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2258)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1375)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:2258)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:2211)
	at org.apache.spark.SparkContext.$anonfun$new$34(SparkContext.scala:681)
	at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
	at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
	at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:995)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2393)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2414)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2433)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1048)
	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$collectAsMap$1(PairRDDFunctions.scala:738)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
	at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:737)
	at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:663)
	at org.apache.spark.ml.tree.impl.RandomForest$.runBagged(RandomForest.scala:208)
	at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:302)
	at org.apache.spark.ml.classification.RandomForestClassifier.$anonfun$train$1(RandomForestClassifier.scala:168)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:139)
	at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:47)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:78)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

25/12/06 20:17:17 ERROR Instrumentation: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2898)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2834)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2833)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2833)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1253)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1253)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1253)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3102)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3036)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3025)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:995)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2393)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2414)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2433)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1048)
	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$collectAsMap$1(PairRDDFunctions.scala:738)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
	at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:737)
	at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:663)
	at org.apache.spark.ml.tree.impl.RandomForest$.runBagged(RandomForest.scala:208)
	at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:302)
	at org.apache.spark.ml.classification.RandomForestClassifier.$anonfun$train$1(RandomForestClassifier.scala:168)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:139)
	at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:47)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:78)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.OutOfMemoryError: Java heap space

25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@6b13b34a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 26, active threads = 26, queued tasks = 0, completed tasks = 11121]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 13.0 in stage 659.0 (TID 11128) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space

Driver stacktrace:)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 24.0 in stage 659.0 (TID 11139) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space

Driver stacktrace:)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5c124c3 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 4, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 8.0 in stage 659.0 (TID 11123) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space

Driver stacktrace:)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@4615ec86 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 6.0 in stage 659.0 (TID 11121) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space

Driver stacktrace:)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@19b503ef rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@52ca79a8 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Shutting down, pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 11125]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@7f6068b4 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@565aad10 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5a9a8d76 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@48817daf rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@52f49aa rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@28e890e1 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@6743e62 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@5411bc29 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@395bf5ca rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@2d313338 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@7896f707 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@4d11d55d rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@12fabfe rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@7a208397 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@ca0aba6 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@779437a9 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5417efe6 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@55f68507 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5182b16a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@3bd78595 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@240eccbd rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@34becbc0 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@785cf40a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@4623354c rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@f309595 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@53f80d15 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:18 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@59c7ae31 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 11145]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
	at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3699, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_240763/3648131322.py", line 64, in <module>
    tvs_model_rf = tvs_rf.fit(train)
                   ^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/base.py", line 205, in fit
    return self._fit(dataset)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/tuning.py", line 1464, in _fit
    for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
  File "/usr/lib/python3.12/multiprocessing/pool.py", line 873, in next
    raise value
  File "/usr/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/tuning.py", line 1464, in <lambda>
    for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
                                                             ^^^
  File "/opt/spark/python/pyspark/util.py", line 342, in wrapped
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/tuning.py", line 113, in singleTask
    index, model = next(modelIter)
                   ^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/base.py", line 98, in __next__
    return index, self.fitSingleModel(index)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/base.py", line 156, in fitSingleModel
    return estimator.fit(dataset, paramMaps[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/base.py", line 203, in fit
    return self.copy(params)._fit(dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/pipeline.py", line 134, in _fit
    model = stage.fit(dataset)
            ^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/base.py", line 205, in fit
    return self._fit(dataset)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/wrapper.py", line 381, in _fit
    java_model = self._fit_java(dataset)
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/ml/wrapper.py", line 378, in _fit_java
    return self._java_obj.fit(dataset._jdf)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/pyspark/errors/exceptions/captured.py", line 179, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: <exception str() failed>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
    [... skipping hidden 1 frame]

Cell In[57], line 64
     63 print("Treinando Random Forest com tuning...")
---> 64 tvs_model_rf = tvs_rf.fit(train)
     66 # Melhor modelo dentro do grid

File /opt/spark/python/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params)
    204     else:
--> 205         return self._fit(dataset)
    206 else:

File /opt/spark/python/pyspark/ml/tuning.py:1464, in TrainValidationSplit._fit(self, dataset)
   1463 metrics = [None] * numModels
-> 1464 for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
   1465     metrics[j] = metric

File /usr/lib/python3.12/multiprocessing/pool.py:873, in IMapIterator.next(self, timeout)
    872     return value
--> 873 raise value

File /usr/lib/python3.12/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    124 try:
--> 125     result = (True, func(*args, **kwds))
    126 except Exception as e:

File /opt/spark/python/pyspark/ml/tuning.py:1464, in TrainValidationSplit._fit.<locals>.<lambda>(f)
   1463 metrics = [None] * numModels
-> 1464 for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
   1465     metrics[j] = metric

File /opt/spark/python/pyspark/util.py:342, in inheritable_thread_target.<locals>.wrapped(*args, **kwargs)
    341 SparkContext._active_spark_context._jsc.sc().setLocalProperties(properties)
--> 342 return f(*args, **kwargs)

File /opt/spark/python/pyspark/ml/tuning.py:113, in _parallelFitTasks.<locals>.singleTask()
    112 def singleTask() -> Tuple[int, float, Transformer]:
--> 113     index, model = next(modelIter)
    114     # TODO: duplicate evaluator to take extra params from input
    115     #  Note: Supporting tuning params in evaluator need update method
    116     #  `MetaAlgorithmReadWrite.getAllNestedStages`, make it return
    117     #  all nested stages and evaluators

File /opt/spark/python/pyspark/ml/base.py:98, in _FitMultipleIterator.__next__(self)
     97     self.counter += 1
---> 98 return index, self.fitSingleModel(index)

File /opt/spark/python/pyspark/ml/base.py:156, in Estimator.fitMultiple.<locals>.fitSingleModel(index)
    155 def fitSingleModel(index: int) -> M:
--> 156     return estimator.fit(dataset, paramMaps[index])

File /opt/spark/python/pyspark/ml/base.py:203, in Estimator.fit(self, dataset, params)
    202 if params:
--> 203     return self.copy(params)._fit(dataset)
    204 else:

File /opt/spark/python/pyspark/ml/pipeline.py:134, in Pipeline._fit(self, dataset)
    133 else:  # must be an Estimator
--> 134     model = stage.fit(dataset)
    135     transformers.append(model)

File /opt/spark/python/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params)
    204     else:
--> 205         return self._fit(dataset)
    206 else:

File /opt/spark/python/pyspark/ml/wrapper.py:381, in JavaEstimator._fit(self, dataset)
    380 def _fit(self, dataset: DataFrame) -> JM:
--> 381     java_model = self._fit_java(dataset)
    382     model = self._create_model(java_model)

File /opt/spark/python/pyspark/ml/wrapper.py:378, in JavaEstimator._fit_java(self, dataset)
    377 self._transfer_params_to_java()
--> 378 return self._java_obj.fit(dataset._jdf)

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:

File /opt/spark/python/pyspark/errors/exceptions/captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
    178 try:
--> 179     return f(*a, **kw)
    180 except Py4JJavaError as e:

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:

<class 'str'>: (<class 'ConnectionRefusedError'>, ConnectionRefusedError(111, 'Connection refused'))

During handling of the above exception, another exception occurred:

ConnectionRefusedError                    Traceback (most recent call last)
    [... skipping hidden 1 frame]

File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py:2205, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
   2202         traceback.print_exc()
   2203         return None
-> 2205     self._showtraceback(etype, value, stb)
   2206 if self.call_pdb:
   2207     # drop into debugger
   2208     self.debugger(force=True)

File /opt/jupyterhub-venv/lib/python3.12/site-packages/ipykernel/zmqshell.py:642, in ZMQInteractiveShell._showtraceback(self, etype, evalue, stb)
    636 sys.stdout.flush()
    637 sys.stderr.flush()
    639 exc_content = {
    640     "traceback": stb,
    641     "ename": str(etype.__name__),
--> 642     "evalue": str(evalue),
    643 }
    645 dh = self.displayhook
    646 # Send exception info over pub socket for other clients than the caller
    647 # to pick up

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:471, in Py4JJavaError.__str__(self)
    469 def __str__(self):
    470     gateway_client = self.java_exception._gateway_client
--> 471     answer = gateway_client.send_command(self.exception_cmd)
    472     return_value = get_return_value(answer, gateway_client, None, None)
    473     # Note: technically this should return a bytestring 'str' rather than
    474     # unicodes in Python 2; however, it can return unicodes for now.
    475     # See https://github.com/bartdag/py4j/issues/306 for more details.

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1036, in GatewayClient.send_command(self, command, retry, binary)
   1015 def send_command(self, command, retry=True, binary=False):
   1016     """Sends a command to the JVM. This method is not intended to be
   1017        called directly by Py4J users. It is usually called by
   1018        :class:`JavaMember` instances.
   (...)   1034      if `binary` is `True`.
   1035     """
-> 1036     connection = self._get_connection()
   1037     try:
   1038         response = connection.send_command(command)

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:284, in JavaClient._get_connection(self)
    281     pass
    283 if connection is None or connection.socket is None:
--> 284     connection = self._create_new_connection()
    285 return connection

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:291, in JavaClient._create_new_connection(self)
    287 def _create_new_connection(self):
    288     connection = ClientServerConnection(
    289         self.java_parameters, self.python_parameters,
    290         self.gateway_property, self)
--> 291     connection.connect_to_java_server()
    292     self.set_thread_connection(connection)
    293     return connection

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:438, in ClientServerConnection.connect_to_java_server(self)
    435 if self.ssl_context:
    436     self.socket = self.ssl_context.wrap_socket(
    437         self.socket, server_hostname=self.java_address)
--> 438 self.socket.connect((self.java_address, self.java_port))
    439 self.stream = self.socket.makefile("rb")
    440 self.is_connected = True

ConnectionRefusedError: [Errno 111] Connection refused
---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/async_helpers.py:128, in _pseudo_sync_runner(coro)
    120 """
    121 A runner that does not really allow async execution, and just advance the coroutine.
    122 
   (...)    125 Credit to Nathaniel Smith
    126 """
    127 try:
--> 128     coro.send(None)
    129 except StopIteration as exc:
    130     return exc.value

File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3413, in InteractiveShell.run_cell_async(self, raw_cell, store_history, silent, shell_futures, transformed_cell, preprocessing_exc_tuple, cell_id)
   3409 exec_count = self.execution_count
   3410 if result.error_in_exec:
   3411     # Store formatted traceback and error details
   3412     self.history_manager.exceptions[exec_count] = (
-> 3413         self._format_exception_for_storage(result.error_in_exec)
   3414     )
   3416 # Each cell is a *single* input, regardless of how many lines it has
   3417 self.execution_count += 1

File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3474, in InteractiveShell._format_exception_for_storage(self, exception, filename, running_compiled_code)
   3470         except Exception:
   3471             # In case formatting fails, fallback to Python's built-in formatting.
   3472             stb = traceback.format_exception(etype, evalue, tb)
-> 3474 return {"ename": etype.__name__, "evalue": str(evalue), "traceback": stb}

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:471, in Py4JJavaError.__str__(self)
    469 def __str__(self):
    470     gateway_client = self.java_exception._gateway_client
--> 471     answer = gateway_client.send_command(self.exception_cmd)
    472     return_value = get_return_value(answer, gateway_client, None, None)
    473     # Note: technically this should return a bytestring 'str' rather than
    474     # unicodes in Python 2; however, it can return unicodes for now.
    475     # See https://github.com/bartdag/py4j/issues/306 for more details.

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1036, in GatewayClient.send_command(self, command, retry, binary)
   1015 def send_command(self, command, retry=True, binary=False):
   1016     """Sends a command to the JVM. This method is not intended to be
   1017        called directly by Py4J users. It is usually called by
   1018        :class:`JavaMember` instances.
   (...)   1034      if `binary` is `True`.
   1035     """
-> 1036     connection = self._get_connection()
   1037     try:
   1038         response = connection.send_command(command)

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:284, in JavaClient._get_connection(self)
    281     pass
    283 if connection is None or connection.socket is None:
--> 284     connection = self._create_new_connection()
    285 return connection

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:291, in JavaClient._create_new_connection(self)
    287 def _create_new_connection(self):
    288     connection = ClientServerConnection(
    289         self.java_parameters, self.python_parameters,
    290         self.gateway_property, self)
--> 291     connection.connect_to_java_server()
    292     self.set_thread_connection(connection)
    293     return connection

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:438, in ClientServerConnection.connect_to_java_server(self)
    435 if self.ssl_context:
    436     self.socket = self.ssl_context.wrap_socket(
    437         self.socket, server_hostname=self.java_address)
--> 438 self.socket.connect((self.java_address, self.java_port))
    439 self.stream = self.socket.makefile("rb")
    440 self.is_connected = True

ConnectionRefusedError: [Errno 111] Connection refused
In [56]:
from pyspark.ml.classification import GBTClassifier

# Modelo
gbt = GBTClassifier(
    labelCol="label",
    featuresCol="features",
    maxIter=30,          # número de árvores
    maxDepth=5,          # profundidade limitada no Spark
    stepSize=0.1         # learning rate
)

# Pipeline
pipeline_gbt = Pipeline(stages=indexers + encoders + [assembler, gbt])

print("Treinando modelo GBT...")
gbt_model = pipeline_gbt.fit(train)

# Predição
pred_gbt = gbt_model.transform(test)

# Avaliação
evaluator_auc = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction")
evaluator_f1  = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

auc_gbt = evaluator_auc.evaluate(pred_gbt)
f1_gbt  = evaluator_f1.evaluate(pred_gbt)
acc_gbt = evaluator_acc.evaluate(pred_gbt)

print("=== GBT RESULTS ===")
print("AUC:     ", auc_gbt)
print("F1 score:", f1_gbt)
print("Accuracy:", acc_gbt)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[56], line 13
      4 gbt = GBTClassifier(
      5     labelCol="label",
      6     featuresCol="features",
   (...)      9     stepSize=0.1         # learning rate
     10 )
     12 # Pipeline
---> 13 pipeline_gbt = Pipeline(stages=indexers + encoders + [assembler, gbt])
     15 print("Treinando modelo GBT...")
     16 gbt_model = pipeline_gbt.fit(train)

NameError: name 'indexers' is not defined
In [57]:
from pyspark.sql import functions as F

# usar df_small ou df_model (escolhe um dos dois)
base = df_small  

area_features = (
    base
    .groupBy("community_area")
    .agg(
        F.count("*").alias("total_crimes"),
        F.avg("label").alias("arrest_rate"),
        F.avg(F.when(F.col("domestic") == "true", 1).otherwise(0)).alias("domestic_rate"),
        F.avg("hour").alias("avg_hour")
    )
)

area_features.show(10, truncate=False)
print("Nº de bairros agregados:", area_features.count())
                                                                                
+--------------+------------+-------------------+--------------------+------------------+
|community_area|total_crimes|arrest_rate        |domestic_rate       |avg_hour          |
+--------------+------------+-------------------+--------------------+------------------+
|31            |3496        |0.2640160183066362 |0.10411899313501144 |13.031178489702517|
|65            |2681        |0.21484520701230883|0.11301753077209996 |13.332711674748229|
|53            |5775        |0.23965367965367965|0.1986147186147186  |13.297316017316017|
|34            |1311        |0.2288329519450801 |0.08009153318077804 |13.273073989321128|
|28            |10610       |0.24052780395852968|0.07935909519321395 |12.90961357210179 |
|76            |1993        |0.2222779729051681 |0.061214249874560964|12.08078273958856 |
|27            |6650        |0.3924812030075188 |0.16421052631578947 |13.256691729323308|
|26            |6693        |0.43373673987748396|0.15598386373823397 |13.538024802031973|
|44            |7689        |0.2562101703732605 |0.18012745480556638 |13.040057224606581|
|12            |643         |0.09486780715396578|0.08709175738724728 |13.211508553654744|
+--------------+------------+-------------------+--------------------+------------------+
only showing top 10 rows

[Stage 182:==========>                                            (6 + 26) / 32]
Nº de bairros agregados: 78
                                                                                
In [58]:
from pyspark.sql.functions import log1p

area_features = area_features.withColumn(
    "log_total_crimes",
    log1p(F.col("total_crimes"))
)
In [59]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

# escolher quais colunas entram no cluster
feature_cols = ["log_total_crimes", "arrest_rate", "domestic_rate", "avg_hour"]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features_raw"
)

assembled = assembler.transform(area_features)

scaler = StandardScaler(
    inputCol="features_raw",
    outputCol="features",
    withMean=True,
    withStd=True
)

scaled_data = scaler.fit(assembled).transform(assembled)
                                                                                
In [60]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

evaluator = ClusteringEvaluator(
    featuresCol="features",
    predictionCol="prediction",
    metricName="silhouette"
)

for k in [2, 3, 4, 5, 6]:
    kmeans = KMeans(k=k, seed=42, featuresCol="features")
    model_k = kmeans.fit(scaled_data)
    preds_k = model_k.transform(scaled_data)
    
    sil = evaluator.evaluate(preds_k)
    print(f"k = {k}  ->  Silhouette = {sil:.4f}")
25/12/06 21:24:39 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                
k = 2  ->  Silhouette = 0.3591
                                                                                
k = 3  ->  Silhouette = 0.2230
                                                                                
k = 4  ->  Silhouette = 0.4233
                                                                                
k = 5  ->  Silhouette = 0.3192
[Stage 490:=====================================================> (31 + 1) / 32]
k = 6  ->  Silhouette = 0.3374
                                                                                
In [65]:
from pyspark.sql.functions import col

# supondo que este DF seja o resultado por área com o cluster
# (adapte o nome para o que você usou)
clusters_area_sem_outlier = area_clusters.filter(col("community_area") != 0)
In [66]:
clusters_area_sem_outlier.groupBy("prediction").agg(
    F.count("*").alias("num_bairros"),
    F.avg("total_crimes").alias("media_crimes"),
    F.avg("arrest_rate").alias("media_arrest_rate"),
    F.avg("domestic_rate").alias("media_domestic_rate"),
    F.avg("avg_hour").alias("media_avg_hour")
).orderBy("prediction").show(truncate=False)
[Stage 551:================================>                     (19 + 13) / 32]
+----------+-----------+------------------+-------------------+-------------------+------------------+
|prediction|num_bairros|media_crimes      |media_arrest_rate  |media_domestic_rate|media_avg_hour    |
+----------+-----------+------------------+-------------------+-------------------+------------------+
|0         |36         |2080.1944444444443|0.18900757828887876|0.13223657077622575|13.072213710432358|
|1         |30         |7234.033333333334 |0.2911635801673923 |0.1542232692542025 |13.276360871827823|
|2         |11         |5625.545454545455 |0.20001915870740763|0.0860961990977164 |12.627060485980657|
+----------+-----------+------------------+-------------------+-------------------+------------------+

                                                                                
In [69]:
from pyspark.sql.functions import col, count, avg

crimes_area = crimes_df2.groupBy("community_area").agg(
    count("*").alias("total_crimes"),
    avg(col("arrest").cast("int")).alias("arrest_rate"),
    avg(col("domestic").cast("int")).alias("domestic_rate"),
    avg(col("hour")).alias("avg_hour")
)

crimes_area.show(5)
[Stage 606:==========================================>            (25 + 7) / 32]
+--------------+------------+-------------------+-------------------+------------------+
|community_area|total_crimes|        arrest_rate|      domestic_rate|          avg_hour|
+--------------+------------+-------------------+-------------------+------------------+
|            31|       70496|0.25990127099409893|0.10858772128915116|13.016454834316841|
|            65|       52703|0.22070849856744398|0.11919625068781663|13.415308426465286|
|            53|      117229|0.24314802651221115| 0.1989951291915823|13.349981659828199|
|            34|       27148|0.22642551937527627|0.07499631648740239| 13.48276116104317|
|            28|      217006| 0.2418274149101868|0.08500225800208289|13.000184326700644|
+--------------+------------+-------------------+-------------------+------------------+
only showing top 5 rows

                                                                                
In [70]:
from pyspark.sql.functions import col
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler, StandardScaler

# ----------------------------------------------------------
# 1. REMOVER OUTLIER (community_area = 0)
# ----------------------------------------------------------
data_no_outlier = crimes_area.filter(col("community_area") != 0)

# ----------------------------------------------------------
# 2. MONTAR FEATURES PARA O K-MEANS
# ----------------------------------------------------------
assembler = VectorAssembler(
    inputCols=["total_crimes", "arrest_rate", "domestic_rate", "avg_hour"],
    outputCol="features_raw"
)

assembled = assembler.transform(data_no_outlier)

# Padronização (melhora muuuuito o clustering)
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
scaled_data = scaler.fit(assembled).transform(assembled)

# ----------------------------------------------------------
# 3. RODAR O K-MEANS JÁ SEM O OUTLIER
# ----------------------------------------------------------
best_k = 4  # depois de analisar a silhouette

kmeans_final = KMeans(k=best_k, seed=42, featuresCol="features")
kmeans_model = kmeans_final.fit(scaled_data)

area_clusters = kmeans_model.transform(scaled_data)
[Stage 659:=============>                                         (8 + 24) / 32]
+--------------+------------+-------------------+--------------------+------------------+----------+
|community_area|total_crimes|arrest_rate        |domestic_rate       |avg_hour          |prediction|
+--------------+------------+-------------------+--------------------+------------------+----------+
|4             |51280       |0.17593603744149766|0.07868564742589704 |12.963845553822154|0         |
|5             |42353       |0.17873586286685714|0.05333742592024178 |13.041626331074541|0         |
|6             |144756      |0.1797922020503468 |0.04656801790599353 |12.32049103318688 |0         |
|7             |111392      |0.13280127836828498|0.0329018241884516  |12.847960356219478|0         |
|8             |252839      |0.25234635479494855|0.044589640047619235|12.661353667749042|0         |
|22            |148347      |0.19590554578117522|0.11105718349545322 |12.968553459119496|0         |
|24            |210238      |0.1772657654658054 |0.07992370551470239 |12.782203978348347|0         |
|28            |217006      |0.2418274149101868 |0.08500225800208289 |13.000184326700644|0         |
|56            |59169       |0.23860467474522132|0.10801264175497305 |12.884804542919435|0         |
|57            |25800       |0.18294573643410852|0.10484496124031008 |12.698527131782946|0         |
|76            |43640       |0.2442942254812099 |0.05524747937671861 |12.206118240146655|0         |
|77            |71823       |0.21203514194617323|0.08792448101583059 |13.001308772955738|0         |
|9             |7128        |0.11307519640852974|0.15404040404040403 |12.687149270482603|1         |
|10            |31280       |0.13945012787723784|0.11975703324808185 |12.727557544757033|1         |
|11            |28643       |0.16195929197360612|0.1388122752504975  |12.97950633662675 |1         |
|12            |13342       |0.10433218408034778|0.10283315844700944 |12.995952630789986|1         |
|14            |64187       |0.22630750775079067|0.1353233520806394  |13.109975540218425|1         |
|16            |80918       |0.18046664524580439|0.12904421760300552 |12.873847598803728|1         |
|17            |44302       |0.15493657171233804|0.1466750936752291  |13.118391946187531|1         |
|18            |17077       |0.14212098143702057|0.1661298822978275  |13.029571938865141|1         |
|19            |131404      |0.23468844175215367|0.15241545158442665 |13.010905299686463|1         |
|20            |43084       |0.2519728901680438 |0.1700631324853774  |13.156601058397548|1         |
|21            |66413       |0.1961664132022345 |0.1253670215168717  |13.01346121994188 |1         |
|36            |16513       |0.17961606007388117|0.21255980136861866 |13.228123296796463|1         |
|39            |41525       |0.17016255267910896|0.1717278747742324  |13.029163154726069|1         |
|41            |46175       |0.14163508391987006|0.10388738494856524 |13.005024363833243|1         |
|45            |36736       |0.18423344947735193|0.13937282229965156 |12.764890026132404|1         |
|47            |10731       |0.20669089553629671|0.16335849408256453 |12.98154878389712 |1         |
|48            |39197       |0.17391637115085337|0.14077607980202567 |12.780825063142588|1         |
|50            |29002       |0.22712226742983244|0.16988483552858424 |13.028135990621337|1         |
|51            |47279       |0.21641743691702447|0.1557351043803803  |13.12083588908395 |1         |
|52            |35408       |0.20210122006326253|0.1478197017623136  |13.036517171260732|1         |
|54            |32396       |0.22329917273737498|0.23515248796147672 |13.228299790097543|1         |
|55            |15840       |0.17241161616161615|0.15132575757575759 |12.999810606060606|1         |
|58            |69144       |0.2457769293069536 |0.13885514289020015 |12.953531759805623|1         |
|60            |45725       |0.18508474576271186|0.11737561509021323 |12.926867140513941|1         |
|62            |27501       |0.1845023817315734 |0.11886840478528053 |12.884549652739901|1         |
|63            |65274       |0.25089622207923523|0.13594999540398933 |12.892529950669486|1         |
|64            |28527       |0.15553685981701545|0.16692957548988677 |12.756581484207944|1         |
|70            |65081       |0.15626680598024   |0.14776970237089165 |12.950753676188135|1         |
|72            |26039       |0.12861477015246361|0.1197434617304812  |12.838434655708745|1         |
|73            |85436       |0.21907626761552507|0.17461023456154315 |13.079849243878458|1         |
|74            |16132       |0.15521943962310936|0.1462310934787999  |12.670654599553682|1         |
|75            |57187       |0.21347509049259447|0.1681675905363107  |13.19233392204522 |1         |
|23            |223982      |0.3750524595726442 |0.15371324481431545 |13.323400987579358|2         |
|25            |448276      |0.3751773460992781 |0.16956294782678527 |13.273121915962488|2         |
|26            |135522      |0.4350363778574696 |0.1521819335606027  |13.483353256297871|2         |
|27            |134276      |0.3833224105573595 |0.15910512675385027 |13.264001012839227|2         |
|29            |209901      |0.3711559258888714 |0.174210699329684   |13.253867299345883|2         |
|40            |75810       |0.2740535549399815 |0.17120432660598867 |13.181651497163962|2         |
|42            |114737      |0.2926344596773491 |0.1883001995868813  |13.422339785770937|2         |
|43            |236555      |0.22633214263067786|0.2010652913698717  |13.050609794762318|2         |
|44            |158031      |0.24979909005195183|0.1829767577247502  |13.091931329928938|2         |
|46            |132217      |0.24912076359318394|0.161484529220902   |13.112890172973218|2         |
|49            |190600      |0.2614323189926548 |0.17921301154249739 |13.193216159496327|2         |
|53            |117229      |0.24314802651221115|0.1989951291915823  |13.349981659828199|2         |
|61            |144352      |0.32352859676346707|0.14177150299268454 |13.353178341831079|2         |
|66            |174517      |0.26446707197579605|0.17548433676948377 |13.210758837247948|2         |
|67            |205117      |0.2889131568811946 |0.19373820794960925 |13.28728969319949 |2         |
|68            |187126      |0.27648750040079945|0.20367025426717827 |13.088630120881117|2         |
|69            |178267      |0.2581913646384356 |0.20433955807861243 |12.960901344612296|2         |
|71            |203100      |0.25450024618414574|0.19845396356474643 |13.033929098966027|2         |
|1             |110627      |0.26122013613313205|0.11367026132860875 |13.281522593941805|3         |
|2             |91678       |0.16165274111564387|0.10712493728048168 |13.323861777089379|3         |
|3             |105063      |0.2931288845740175 |0.08818518412761867 |13.421880205210208|3         |
|13            |24157       |0.17059237488098689|0.09260255826468518 |13.265016351368134|3         |
|15            |90729       |0.21881647543784236|0.14296421210417837 |13.258032161712352|3         |
|30            |120464      |0.2744886439102138 |0.13746015407092577 |13.259330588391553|3         |
|31            |70496       |0.25990127099409893|0.10858772128915116 |13.016454834316841|3         |
|32            |177732      |0.24362523349762563|0.03131118763081493 |13.406685346476717|3         |
|33            |55471       |0.30428512195561647|0.07829316219285753 |13.159398604676317|3         |
|34            |27148       |0.22642551937527627|0.07499631648740239 |13.48276116104317 |3         |
|35            |79828       |0.34574334819862707|0.10941023199879742 |13.2325625093952  |3         |
|37            |23785       |0.34559596384275804|0.11242379651040572 |13.376371662812698|3         |
|38            |99216       |0.2652596355426544 |0.13675213675213677 |13.21734397677794 |3         |
|59            |29131       |0.23397754968933437|0.10689643335278569 |13.20273934983351 |3         |
|65            |52703       |0.22070849856744398|0.11919625068781663 |13.415308426465286|3         |
+--------------+------------+-------------------+--------------------+------------------+----------+

                                                                                
In [73]:
from pyspark.sql.functions import countDistinct, avg, round as Fround

# Resumo dos clusters
clusters_resumo = (
    area_clusters
    .groupBy("prediction")
    .agg(
        countDistinct("community_area").alias("num_bairros"),
        Fround(avg("total_crimes"), 2).alias("media_crimes"),
        Fround(avg("arrest_rate"), 3).alias("media_arrest_rate"),
        Fround(avg("domestic_rate"), 3).alias("media_domestic_rate"),
        Fround(avg("avg_hour"), 2).alias("media_avg_hour")
    )
    .orderBy("prediction")
)

clusters_resumo.show(truncate=False)
[Stage 665:=========================>                            (15 + 17) / 32]
+----------+-----------+------------+-----------------+-------------------+--------------+
|prediction|num_bairros|media_crimes|media_arrest_rate|media_domestic_rate|media_avg_hour|
+----------+-----------+------------+-----------------+-------------------+--------------+
|0         |12         |114886.92   |0.201            |0.074              |12.78         |
|1         |32         |42644.56    |0.184            |0.149              |12.97         |
|2         |18         |181645.28   |0.3              |0.178              |13.22         |
|3         |15         |77215.2     |0.255            |0.104              |13.29         |
+----------+-----------+------------+-----------------+-------------------+--------------+

                                                                                
In [75]:
# Garantir que não tem community_area = 0 (outlier já removido no KMeans)
# Se o area_clusters já foi gerado a partir do data_no_outlier, nem precisa disso
area_clusters_clean = area_clusters.filter("community_area != 0")

area_clusters_pd = (
    area_clusters_clean
        .select("community_area", "prediction")
        .dropDuplicates()   # um registro por área
        .toPandas()
)
                                                                                
In [76]:
# Fazer o join pelo código da community_area
gdf_clusters = gdf.merge(
    area_clusters_pd,
    on="community_area",
    how="inner"   # só áreas que têm cluster
)
In [77]:
gdf_clusters[["community_area", "prediction"]].head()
Out[77]:
community_area prediction
0 26.0 2
1 27.0 2
2 25.0 2
3 23.0 2
4 29.0 2
In [78]:
fig, ax = plt.subplots(figsize=(8, 8))

gdf_clusters.plot(
    column="prediction",
    ax=ax,
    categorical=True,     # clusters são categorias
    legend=True,
    edgecolor="black",
    linewidth=0.6
    # você pode passar um cmap se quiser, ex.: cmap="tab10"
)

ax.set_title("Clusters de Community Areas por perfil de crimes e prisões", fontsize=14)
ax.set_axis_off()

plt.tight_layout()
plt.savefig("chicago_clusters_crime_map.png", dpi=150, bbox_inches="tight")
plt.show()
No description has been provided for this image