Key Responsibilies: 1. Data Pipeline Development: o Build and maintain scalable and efficient data pipelines using Python and PySpark. o Write SQL queries to extract, transform, and load data from Amazon Athena. o Work on ingeson of large-scale data from various sources using AWS tools (e.g., S3, Redshi, Glue, Lambda, Athena, etc.) with an emphasis on ensuring data quality and integrity. o Implement data quality checks, validaon rules, and error-handling mechanisms to ensure accurate data ingeson. Data Exploraon and Wrangling: o Ulize NLP techniques and regular expressions to explore, clean, and process unstructured datasets. Perform data wrangling to structure raw data for further analysis using Pandas or PySpark as appropriate. o Able to apply intermediate SQL techniques for data manipulaon and aggregaon, opmizing queries for large datasets. o Handle geospaal data and processes related to geolocaon, geofencing, and enty deduplicaon (e.g., address normalizaon and deduplicaon of enes such as companies). 3. Exploratory Data Analysis (EDA): o Conduct exploratory data analysis using Python, PySpark, and SQL to discover paerns, trends, and anomalies within large data sets, parcularly in the logiscs and supply chain management domain. o Leverage knowledge of data from logiscs, geolocaon, and supply chain to idenfy key insights and make data-driven recommendaons. 4. Data Evaluaon & Quality Assurance: o Idenfy and troubleshoot data quality issues related to geolocaon, company addresses, logiscs, and supply chain data, ensuring data consistency, accuracy, and reliability. 5. Communicaon & Documentaon: o Reports: Communicate data findings and insights to both technical and nontechnical stakeholders through detailed reports. o Task Tracking: Document project progress and tasks using Jira and other project management tools to ensure clear communicaon and status tracking across teams. Collaborate with cross-funconal teams to ensure alignment on project objecves and deliverables.